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1 abstract 


A parameterized version of the tree processor has been designed and tested (by simulation). 
The leaf processor design is 90% complete. We expect to complete and test a combination 
of tree and leaf cell design in the next period. Work has been proceeding on algorithms 
for the CAM, and once the design is complete we will begin simulating algorithms for large 
problems. 

In the last 6 months we have produced four publications that describe various components 
of our research. They are summarized below. 

• J. Storrs Hall, Donald E. Smith, and Saul Levy, The Practical Implementation of 
Content Addressable Memory, LCSR-TR-179, Laboratory for Computer Science 
Research, Rutgers University, March 92. This was also submitted to the 1992 Frontiers 
of Massively Parallel Computation Conference. 

LCSR-TR-179 presents a functional description of our CAM architecture and discusses 
attributes(e.g., density, scalability, data-path width, and coupling) that determine the 
effectiveness of such architectures. In addition, two examples are presented that demon- 
strate the use of CAM-based algorithms. 

• Donald E. Smith, Keith M. Miyake, and J. Storrs Hall, Design of a LEAF cell for 
the Rutgers CAM Architecture, LCSR-TR-180, Laboratory for Computer Science 
Research, Rutgers University, March 92. 

LCSR-TR-180 presents a specification of the LEAF cell and its interfaces to other 
modules in the CAM architecture. It describes the four communicating processors 
which compose each LEAF cell (i.e., k-bit, 1-bit, 10, and memory) and their respective 
interfaces. 

• Keith M. Miyake, Donald E. Smith, Circuit Design Tool User’s Manual, LCSR- 
TR-181, Laboratory for Computer Science Research, Rutgers University, March 92. 

LCSR-TR-181 describes the design tool we have implemented in support of CAM re- 
search. The design tool is written for a UNIX software environment and supports the 
definition of digital electronic modules, the composition of modules into higher level 
circuits, and event-driven simulation of the resulting circuits. Our tool provides an in- 
terface whose goals include straightforward but flexible primitive module definition and 
circuit composition, efficient simulation, and a debugging environment that facilitates 
design verification and alteration. 

The unique architectural aspects of the Rutgers CAM uses many of the features typical 
to most design tools; however, it also requires some features that are not widely sup- 
ported. Our design makes use of many similar, but not identical, modules which puts 
a premium on design tools that support parameterized modules (e.g., generic entities 
in VHDL) and strong typing of a module’s ports. In addition, since our design is con- 
tinually evolving, the quality of and control over error handling, in both the design as 
well as the simulation phase, is very important. In looking for a design tool to support 
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our research we found that either some critical features were inadequately supported 
or the tool was much more than we required. In response to this environment we im- 
plemented a prototype design tool that supports exactly the features required by the 
design of the CAM. 

• S. Wei and S. Levy, Design and Analysis off Efficient Hierarchical Interconnec- 
tion Networks, LCSR-TR-167, Laboratory for Computer Science Research, Rutgers 
University, September 91. A shorter version of this paper was published as: S. Wei 
and S. Levy, Design and Analysis off Efficient Hierarchical Interconnection 
Networks, Proceeding of 1991 Supercomputing Conference, Nov. 91, pgs 390-399. 

LCSR-TR-167 presents a new approach to message-passing architectures based on 
the general idea of hierarchical interconnects. The approach chooses the appropriate 
number of interface nodes and clusters based on performance and cost-effectiveness. 
The report includes both static and queueing analyses of such networks. 
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Abstract 

The notion of using content addressable memory (CAM) to achieve massively parallel 
processing has resurfaced regularly since it first appeared in the 1960’s, but has consistently 
failed to produce cost-effective general-purpose systems. An analysis of this situation 
reveals a number of specific design pitfalls regarding memory density, scalability, datapath 
width, and processor coupling. Once these are avoided, specific functionalities must be 
included in the design. This paper details the pitfalls and presents an architecture which 
avoids them. Further, guidelines are developed for estimating CAM’s effectiveness as a 
parallel processor. 

1 Supported by DARPA and NASA under NASA-Ames grant NAG 2-668 
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Introduction 

“Pure” content addressable memory (CAM), such as is used in cache lookup, is a 
method of addressing where each word of memory has a variable address, explicitly stored 
with the word. With every memory access, the desired address is compared with the 
stored address in every word. We are concerned here with an extension of this concept 
which forms the basis for a method of massively parallel processing. It dates back at 
least to Falkoff[62] and has been called many things, including content addressable parallel 
processing (Foster[76]) and associative computing (Potter[88]). We shall use CAM as a 
generic term, to include parallel processing, and will refer to “pure CAM” if it is necessary 
to distinguish the simple form. 

This paradigm is well explained in Foster [76]. The broadcast value, instead of being 
compared strictly with the address portion of the word, is compared with the entire word 
(with a mask to provide “don’t care” bit positions). The result of the comparison, rather 
than immediately causing a read or write of a matching word, is stored in an explicit 
“response bit”. The bit is then used to control subsequent read/write operations; in 
particular, more than one word can be written into simultaneously. Boolean functions are 
then synthesized from sequences of tests, and bit-serial aithmetic can be performed on all 

words in parallel. 

At a higher level, the model allows for logic, comparisons, and arithmetic between some 
global value and a local value stored in each word, or between local values in each word. 
Individual words may refrain from the operations based on locally determined conditions. 
This results in an architecture which is equivalent to a SIMD star network, with the CPU 
as the hub and each word as a leaf processor. Since the hardware in the memory is only a 
few gates in addition to a flip-flop at each bit, CAM should, or so the theory goes, form 
the basis for massively parallel processors at densities near those of static RAMs. 

We examine two questions: (a) can CAM (or some mechanism that implements the 
CAM computational model) really be built within a small constant factor in cost of RAM, 
and (b) if so, how efficiently can it be used? 

Typical implementations of CAM 

This section explains how, starting from the basic idea of CAM, a designer might 
ultimately end up with any of a number of existing processor array architectures. This is 
not to be taken to imply that any of those systems were designed that way, nor indeed had 
the CAM model in mind at all. It does, however, fairly reflect the authors’ own attempts 

to find a cost-effective realization of the paradigm. 

The first problem the CAM designer meets is that CAMs useful for parallel processing 
require very wide words-256 bits is not unreasonable. Furthermore, each bit position 
requires a data line and a mask line in the bus, doubling its width. Requiring a 512-bit 
wide bus is not impossible, but the CAM also requires a connection between each bit m 
any given word, which ordinary memory does not. Thus it is problematical to split CAM 
memories onto separate chips, requiring the pinouts of each chip to handle the entire bus 

width. 

What is worse, in the process of most CAM operations, a large portion of the bus is 
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wasted. Most comparisons are of smaller fields within a word (selected by the mask lines), 
and in the case of the bit-serial, word-parallel arithmetic, only one or two bits are being 
tested or set at a time! Clearly, an optimization can be performed: all the comparators 
at each bit can be removed, and in their place a one-bit ALU can be added to the word. 
The “match” daisy chain and the read/write control lines can be replaced with a one-bit 
local bus running across the word. Now the (global) bus is much smaller: an opcode for 
the ALU, a bit address which can be decoded on-chip, a one-bit data bus. What is more, 
arithmetic is faster; single-bit arithmetic operations are built into the ALU rather than 
being synthesized out of logic operations which are in turn synthesized out of masking and 

testing. 

At this point there is a strong temptation for the designer to split the architecture 
across chips, having processor chips which connect to standard memory chips. This has 
the advantage of making memory much less expensive, but the disadvantage of tying the 
number of words to the pinout of the processor chip. The relationship to the original CAM 
paradigm begins to become somewhat strained also; this implementation puts a strong 
downward pressure on the number of words/processing elements, and a strong upward 
pressure on PE complexity, inter-PE connectivity, etc. In practice, this has been the 
best tradeoff point for CAM-like implementations; it characterizes the evolutionary niche 
occupied by the CM-1, the DAP, the MPP, and even the IUiac. Of course these machines 
were not (necessarily) designed as an implementation of CAM: they are mentioned to 
illustrate the point in the design space toward which CAM tends to gravitate. 

We should mention, as a counterpoint, the STARAN (Batcher [74]), which was de- 
signed to implement CAM ideas. STARAN consisted of a number of 256-word arrays of 
256-bit words. Today a STARAN array could be put on a single chip, but there would 
still be the bus to contend with. Perhaps predictably, the major thrust in CAM-like 
architectures in the interim has been along the processor-array ’lines. 


Criteria 

Given these facts, it behooves the architect of a CAM-based system to develop a very 
strong theory of why the CAM paradigm has failed to produce a cost-effective general- 
purpose architecture. Here is the theory: 

o Density: The basic CAM algorithms are based on the assumption that all of the 
memory in the system is CAM. No implementation to date has come near this. CAM 
has been a scarce commodity, backed up by RAM, and data is swapped in and out 
of CAM to be processed. This is deadly, since CAM at best transforms a linear 
search or other simple loop to a constant-time operation; swapping the data in or out 
re-introduces the linear time. 

o Scalability: For physical practicality, CAM chips must have constant pinout, indepen- 
dent of the number of words per chip. The size of ordinary RAM is an extraordinarily 
scalable feature of the von Neumann architecture, varying by more than six orders 
of magnitude over the range of different systems. If CAM is not also scalable in this 
very strict sense, it will fail to substitute for RAM. 

o Datapath width: In choosing a one-bit processor, a CAM implementation gams faster 
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arithmetic but loses content addressability. If comparisons are bit serial, CAM is 
slower than conventional indexing structures such as binary trees and hash tables. To 
be usable as CAM, a memory must be able to compare, add, and subtract in a time 
comparable to a normal memory access. 

o Coupling: The overwhelming tendency in designing a SIMD processor array is to 
place it as an attached processor to some conventional machine which manages the 
data not being used in the current parallel computation. This is unworkable; the 
bottleneck means that unless there are large speedups to be gained from operations on 
relatively long-lived data, the serial host processor can perform most simple associative 
operations faster than the CAM, when the time to transfer the data to and from the 
attached processor is taken into account. 

Density 

Perhaps the most important criterion is density. The CAM must be usable as mem- 
ory. If this is met, the basic CAM algorithms can be brought into play. Associative 
retrieval allows arrays of simple, explicitly-indexed records to be used instead of trees, 
linked lists, hash tables, priority queues, and inverted indices; what is more, it saves the 
software designer from having to make choices, with their associated tradeoffs, between 
these structures. (See Hall[81], Potter[88].) 

It would be extremely inefficient to use a “true” parallel processor for search opera- 
tions The algorithmic speedup is at best log N , so its efficiency is e.g. 0.002% m the 

case of a million-word CAM. Luckily, as distinct from “pure” CAM, the parallel processing 
CAM model produces significantly better speedups for other operations. If a million-word 
CAM has the same hardware cost as a fully-connected parallel processor with a thousan 
nodes, the CAM need only achieve an average 0.1% speedup to equal, in operations per 
cycle, the true parallel processor with perfect 100% linear speedup. (Caveat: A whole-chip 
processor will almost certainly have faster cycles than a CAM.) 

Even so, we would like to keep the CAM to within some small constant factor of RAM 
cost. While a somewhat speculative analysis indicates that CAM might be rated as having 
a processing power proportional to the square root of the number of words, it remains the 
case that in order to do so it must act as the system’s primary memory. Thus, its cost must 
remain low independent of the processing power it provides. This can be accomplished by 
starting with a high-density DRAM design and only allowing some constant fraction of 
the chip to be used for the active elements that implement the CAM model. 


Scalability 

The only interconnection schemes that meet the strict scalability criterion are a bus, 
a linear array (daisy chain), and a tree. We find that a bus is desireable to distribute 
instructions to the words; a tree is essential for implementing the collective functions t e 
CAM computational model requires; and a long shift register is probably the best way 
to implement an asynchronous I/O capability that does not seriously interfere with other 

operations. 
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Width 

Consider the following design task: yon have 1024 full adders and desire to build a 
machine to compute 1024 evenly spaced points of a linear function y = mx + b in 32-bit 
fields. You can add whatever control, memory, and communications you like. There are 

three strategies: 

o First, you could make a serial processor using all the hardware to form a circuit that 
could multiply in one cycle. Then loop computing mx + b at each iteration, taking 
2048 cycles. This can be improved by a strength reduction optimization, starting with 
b and adding m at each iteration, for a total of only 1024 cycles, 
o Second, use 1024 1-bit ALUs to compute the result directly, multiplying in each pro- 
cessor mi (where i is the processor number stored as a constant) and adding fe, taking 
1024 + 32 = 1056 cycles. This is better than the naive serial algorithm but comparable 
to the optimized one. 

o Third, one can form 32 32-bit ALUs and take advantage of parallelism and strength 
reduction: in a 32-cycle multiply (and a 1-cycle add), form the number 32 im + b at 
each processor (where 32i has been stored as a constant) and use that as a starting 
point for adding m for the next 31 points. This gives you a total of 64 cycles. 

This example is illustrative of a class of algorithms, which we call semi-serial algo- 
rithms, for which ALUs of a width commensurate with the data of interest and capable 
of simple arithmetic and comparison, but not more, form a local optimum in hardware 
efficiency. Note that for large problems using a fixed algorithm, the three designs a ove 
are equivalent: each can do 1024 multiplies in 1024 cycles. 

Coupling 

Having attained memory-like density in the CAM, we can use it to advantage by the 
simple expedient of replacing all the system’s RAM with CAM. (An alternative approach 
would be to use a Harvard-like architecture with RAM for instruction memory and CAM 
for data memory.) It is a long-established trend for processors to be faster than memory 
and to run asynchronously. In practice, this may mean that special processors are not 
necessary for CAM-based systems (although it is certainly possible to design them). 

In a RAM, where only one word is being accessed at a time, clock skew across the 
memory is not a serious problem. In a CAM, it might be. If the connectivity is a tree, out- 
going information, i.e. from the CPU to the CAM may arrive at the memory in a skewed 
fashion harmlessly, since no leaf depends on information from any other leaf. However, 
ingoing information may need to be synchronized. If the ingoing datapaths are combi- 
national, synchronization consists only of waiting the longest number of gate delays from 
CAM to CPU. If the delay is consistent, this is not only simpler but faster than any other 

method. 

Other Features 

To be effective as CAM, a system must not only avoid these pitfalls but be designed 
with a cognizance of typical CAM algorithms in order to make most efficient use of its 


5 


hardware. The following is a list of features we have found to confer substantial algorithmic 
advantages, while remaining implement able within the constraints above: 
o Collective functions: The CAM model is virtually useless without a fairly powerful 
feedback mechanism from the memory to the processor. After an associative search, 
the processor may need to know how many (if any) matches were found. In the classical 
CAM model, operations such as summing all active words, finding the maximum 
or minimum of a set of values, and the like can be done as word-parallel bit-serial 
algorithms. These operations are crucial parts of the basic CAM computational model, 
o Segmentation: The CAM model has the capability of doing in parallel essentially a 
simple loop, dealing only with local and global values at each point. The ability to 
segment the CAM, still doing the same operation everywhere but having a different 
” global” value in each segment, corresponds to doing nested loops, and extends the 
range of parallel operations significantly.- 

o Local addressing: The fields of the CAM words which are going to be operated on by 
an instruction are, in the basic model, the same for each word. The ability to vary 
the field choice on the basis of local data allows the CAM to do things like regular 
expression matching or unification in parallel, a prerequisite to the extension of the 
ideas of content addressability into higher-level models of computation. 


The Rutgers CAM 

At this point we shift terminological gears; the specifics of the model are enough more 
complex that the concept of a “word” splits into two separate terms: a “cell” is the locus of 
one active element and the unit of activity control; the term “word” hereinafter will mean 
an addressable unit of simple memory whose width is that of thehus and other datapaths. 
There are many words per cell. 

The design criteria elucidated in the preceding sections interact strongly with par- 
ticular states of technology to determine the viability of a CAM implementation. As a 
point of departure, let us assign values to the constants as follows: Word width, 32 bits; 
density, IK words per ALU. With these parameters we can devote at least half the silicon 
to DRAM; the density of the CAM as memory will be at least half that of conventional 
memory in the same technology. 

A decade ago, the dominant memory technology was 64 K-bit DRAMs. The above 
constraints would specify a 32 K-bit chip, with one ALU occupying half the space and IK 
32-bit words on the other half, for a total of one cam cell per chip. A one- megabyte memory 
(typical for mainframes of the day) would have consisted of 256 chips (and therefore 256 
cam cells). Assuming the CAM could be driven at 5MHz and do one operation every 5 
cycles, the CAM would have represented a 256 mega-ops peak processing rate. 

For the mid-90’s we can use a 64-Meg DRAM as a basis and obtain IK cam cells per 
chip. (Each cell is still an ALU and IK 32-bit words; the memory still occupies half the 
chip.) 16 chips would provide 64 megabytes of memory and, assuming a 10-MHz operations 
rate (from a 50-MHz system clock), a peak 160 giga-ops. This is a single-board computer. 

We have developed a “CAM virtual machine” as a common focus for architectural 
and algorithmic efforts. This model of the CAM reflects the capabilities of active elements 
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which fit the tradeoffs above, i.e. all the circuitry associated with one CAM cell must be 
roughly the size of IK words of DRAM. 

The CAM model is related to Blelloch’s[87] “scan model of computation”, but differs 
in the fundamental regard that the CAM model does not have a general permutation 
operator and thus is not a complete model for parallel computation. It does include: 
o Activity control: all the following can be controlled on a per-cell per-instruction basis. 

This forms the major difference between CAM and simple vector styles of computation, 
o Parallel vector operations to include addition, subtraction, comparison, bitwise bool- 
ean functions, and some shifting and byte extraction. These operations can only be 
done between words of the same cell, but they need not be the same words in each 
cell. It does not include multiplication, division, or floating point, although these can 
be done in software. 

o Broadcast: one of the operands in the above operations can be a “global” constant 
value (the same in each cell). 

o Collective functions: scalar-valued collective functions include the sum, max, and mm 
of all the elements of a vector; vector- valued collective functions are parallel prefix (and 
suffix) forms of the scalar ones; and skip-shifting. This last moves a value from each 
active cell to the next active cell, no matter how far away, as a unit-time primitive, 
o Segmentation: All of the collective functions and broadcasting can be done in seg- 
ments, which, like the activity, are defineable on the fly. Each segment can have a 
different “global” value which comes from some cell in the segment, 
o Simple one-cell-at-a-time shifting which ignores activity and segment definitions can 
be done concurrently with other CAM operations; such shifting requires time proper- 
tional to distance shifted. 

The physical implementation of the CAM model is, as indicated, by way of a set of 
active elements along with DRAM. Each chip, regardless of the amount of CAM onboard, 
has 4 busses (making it more like a processor chip in its packaging). These are one 
bidirectional data bus, one input-only instruction bus, and two I/O busses, one in and 
one out. 64 pins dedicated to I/O sound extravagant, but in the system as a whole, they 
are the most heavily used part. Furthermore, this interface is constant; CAM is a scalable 
architecture in the strongest sense of the word, 
o Each CAM cell consists of an ALU with 16 registers, and IK (or more) DRAM. The 
ALU, register, memory, and all datapaths are 32 bits wide. The first 4 registers are 
mapped into the tree, the memory, the shifter, and the collection of one-bit registers 
that are the status, activity, segment, and so forth; the rest of the registers are general 
purpose. Each cell is like a very simple RISC with register-to-register operations and 
asynchronous load/store. 

o The cells form the leaves of a tree of simpler ALU’s, each of which has one register. 
The tree is combinational: that is, each CAM cell presents it with a 32-bit value 
and two control bits, activity and segment. The tree forms a direct-wired circuit that 
produces the appropriate value at the root and into each tree node’s latch. This allows 
for virtually any possible clock skew between CAM cells-of course, we pay for this by 
having tree operations take 5 to 10 times as long as local CAM operations. 

In a scan or shift operation, the tree actually does two operations, one up and one 
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down. Each phase is combinational internally, 
o A (unidirectional) instruction bus, emanating from the CPU, which controls the cells 
and the tree nodes. Depending on chip size and process parameters, the bus may 
be pipelined: the bus is optimized for throughput, in contrast to the tree, which is 
optimized for latency. 

o A shift register for overlapped I/O and data motion between CAM cells. Like the 
tree and the DRAM, the shift register operates asynchronously from the CAM cell. 
CAM efficiency is very dependent on its ability to move data. If a loader (see below) 
can relocate a program in, say, 1000 cycles but then requires a million cycles to move 
it from CAM to instruction RAM, the CAM is worthless. This is one of the reasons 
that our architecture has IK or more words in each cell - loading and unloading of 
one problem’s data while another problem is being worked on is crucial to CAM’s 
efficiency. Indeed our design calls for a separate datapath for this function. This can 
be something as simple as a (32-bit wide) shift register with one position for each cam 
cell. It doesn’t even have to be true DMA: in our mid-90’s model, for example, the 
I/O shift register would clock data for 1024 cycles before needing one memory cycle 
to store it. 

CAM Algorithms 

We present the following algorithms, in a very high-level form, to give a feeling for 
both the abilities and the limitations of the CAM. 

Consider a very commonly used program, the relocating linking loader. Its initial 
task, finding the appropriate place in memory to put each of a given set of modules, can 
be as simple as a single parallel prefix sum. Updating the relative addresses at each point 
of the code to absolute addresses for execution is a local operation in each cell. Resolving 
global references, however, depends on the number of distinct symbols referenced (not the 
total number of occurences). This dominates the rest of the process, which is constant 
time. 

For the next algorithm, we will assume that we have a “small” CAM, on the order of 
1000 cells; it is intended to be representative of a simulation and visualization task on a 
machine at the scale of a workstation. We wish to simulate a number of bouncing particles 
in some three-dimensional space (at the appropriate scale, molecules in a gas) and display 
the results on a screen in real time. We will assume that there are enough CAM cells to 
allocate one per particle, and (separately, not in addition) one per pixel for one scan line. 

1. [Advance the particles] X new = X old + VAT for each particle. A purely parallel, local 
operation. 

2. [Find collisions] Naively, this is an 0(N 2 ) sequential time operation, but if the num- 
ber of particles is large enough, a sophisticated implementation would use spatially- 
oriented indexing schemes to reduce the complexity to N log N (e.g. octrees). CAM 
gives us linear time with the naive algorithm, and for the parameters given, that is 
sufficient. (For larger problems, similar indexing schemes could allow enough extra 
use of parallelism to reduce the CAM time to N ? ). 

3. [Simulate collisions] Once the data for each collision has been brought together, the 
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new velocities for each particle can be computed in a single parallel step. 

4. [Display] For each scan line, perform the following steps: 

5. [Select objects] Associatively search for each object whose image intersects the current 
scan line. For each object from farthest to nearest, do step 6: 

6. [Draw] Set the value of each pixel the current object intersects on the current scan 
line. This step could be split into a sequential and a parallel part like steps 2 and 3 if 
the rendering algorithm is complex enough to warrant it. 

This algorithm exemplifies the cases where conventional worst-case asymptotic com- 
plexity analysis is inadequate. Constant factors deriving from the necessity of using so- 
phisticated indexing data structures prevent a real-time implementation on a sequential 

machine at the same range of clock speeds as a CAM. 

CAM also has operations with high constant factors, notably numerical calculations. 
In many cases, fairly simple algorithmic techniques can move these operations out of loops 

and do them in parallel. „ 

The final algorithm is intended to demonstrate what could be done with a large 
CAM, on the order of a million cells. (This would represent 4 gigabytes of RAM.) This 
time the task is a low-level part of an image-understanding process, namely to divide a 
picture up into regions. (E.g., suppose we had a black and white picture of a collection 
of polka-dots. We would want to associate with each white pixel a region number of 0, 
meaning that all the background was a single connected region, and with each black pixel 
a region number indicating which polka-dot it was in.) 

1 [Process scan lines] Assuming the picture is in row-major form, use segmentation to 
do each scan line in parallel. Do short-distance shifts to localize horizontal neighbor 
information, and create a vector with a 1 for each edge, 0 elsewhere. Do a plus-scan 
of this vector; the result is a unique region number within each scan line, with each 

pixel in the region having a copy of the number. * _ ' 

2 [Process one vertical line] For some vertical line, select (with activity control) only 
those pixels in that line. Perform step 1. for that line (using skip-shifting, etc.). Re- 
turning to line-by-line segmentation, combine vertical and horizontal region numbers 
into a global region number. This finds all pixels co-regional in horizontally contiguous 
segments touching the chosen vertical line. 

3. [Other vertical lines] Which and how many vertical lines need to be processed is a 
matter of heuristic. This is considerably assisted by associative search, which can be 
used to skip vertical lines all of whose horizontal segments have been processed by 
the action of other vertical lines. A bisection method works well. In the best case the 
number of vertical lines needed will be on the order of the square root of the number 
of regions. In the worst case it may be the width of the picture, i.e. having to do 
every vertical line. 

If the CAM were, e.g., a mesh-of-trees instead of simply a tree, we could run step 1 
vertically as well as horizontally and be done in two steps. 1 Special-purpose architectures 

1 Actually, a rigorous definition of the algorithmic task can make it arbitrarily complex 
for either architecture, involving long chains of region coalescence fixups. However, as 
a basis for a low-level input-processing step for vision, identification of relatively simple 

regions seems adequate. 
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for vision invariably have such 2- dimensional connectivity. We consider this algorithm to 
show that the CAM handles this problem as well as a general-purpose machine might be 
expected to. Even when heuristics fail, its performance degrades only to y/N . 

Evaluating the Processing Power of CAM 

We must stress that it is virtually meaningless to compare the peak ops rate in CAM 
to serial MIPS. The relationship is extremely problem- dependent, and within a given al- 
gorithm, data-dependent, as is clearly shown in the preceding algorithms. 

Even more important, perhaps, is that CAM cannot be validly compared with typical 
parallel processors, with their general interprocessor communications ability. The mid- 
90’s technology estimates above provide a lK-cell CAM on a chip with roughly 128 data 
pins. These pins would only provide 1-bit-wide datapaths if a mesh architecture were 
used (i.e. the perimeter of a 32x32 square); a hypercube architecture would be completely 
impractical. 

A more valid basis for comparing CAM to other architectures is perhaps by pin count. 
In this scheme one CAM chip might be considered equivalent to a processor chip or 8 
DRAM chips. Thus it could be appropriate to compare the processing power of a million- 
cell CAM to a thousand-processor conventional machine, since they would represent about 
the same amount of hardware. 

But ultimately CAM shouldn’t be compared to conventional parallel machines built 
as networks of microprocessors, because the architectures are orthogonal. Each processor 
of such a conventional parallel machine could be provided with CAM instead of RAM. 
Such an arrangement combines the best advantages of each part, synergistically. However, 
it is beyond the scope of this paper. 

Summary and Conclusion 

Content Addressable Memory is memory. Our criteria of density, scalability, width, 
and coupling mark the boundary between a powerful memory and an anemic parallel 
processor. Within these limits, CAM can be a very cost-effective system component if 
average effective utilization is near even one percent. 

CAM algorithms can be both simpler and faster than conventional ones; with more in- 
genuity, substantially better utilization can be achieved. The Rutgers CAM design provides 
collective functions and segmentation, allowing fairly sophisticated parallel algorithms in 
an architecture which still retains half the density of DRAM in a given technology. 

Acknowlegements 

The authors wish gratefully to acknowlege the substantial contributions of the Rutgers 
CAM Project staff, Keith Miyake and Sizheng Wei. 


10 


References 

[AMD88] Advanced Micro Devices, Inc: AM99C10 Content Addressable Memory data sheets , 
AMD, Sunnyvale, CA, 1988 

[Bat74] Batcher, K. E., STARAN Parallel Processor System Hardware , Nat. Comp. Conf 
1974, pp 405-410. 

[Bla87] Blair, Gerard M.: “A Content Addressable Memory with a Fault- Tolerance Mecha- 
nism ”, IEEE JSSC, vol SC-22, no. 4, pp 614-616, Aug 1987 
[Ble87] Blelloch, Guy: “Scans as Primitive Parallel Operations”, pp 355-362, Proceedings of 
the 15th International Conference on Parallel Processing, Pennsylvania State Univer- 
sity Press, University Park, 1987 

[Fal62] Falkoff, A. D.: Algorithms for Parallel-Search Memories JACM 9 #10, Oct 1962, pp 
488-511. 

[Fos76] Foster, Caxton C.: Content Addressable Parallel Processors,. Van Nostrand 
Reinhold, New York, 1976 

[Fou87] Fountain, Terry: Processor Arrays: Architecture and Applications, Academic 
Press, London, 1987 

[Hal81] Hall, J. S.: A general- Purpose CAM-based System, in VLSI Systems and Com- 
putations, Kung, Sproull, and Steele, ed. pp 379-388, Computer Science Press, 
Rockville, MD, 1981 

[Hal89] Hall, J.S., S. Levy: von Neumannizing the Multi-Search Content Addressable Mem- 
ory in Proceedings of the Symposium on Massively Parallel Processing, pp 27-42, 
University of South Carolina, Columbia SC 1989 
[Hil85] Hillis, W. Daniel: The Connection Machine, MIT Press, Cambridge, 1985 
[Sch87] Hgen, Sener and Isaac D. Scherson: “Parallel Processing on VLSI Associative Mem- 
ory”, pp 50-53, Proceedings of the 15th International Conference on Parallel Process- 
ing, Pennsylvania State University Press, University Park, 1987 
[Koo70] Koo, J. T.: “Integrated Circuit CAM”, pp 208-215, IEEE J. Solid State Circuits, SC5, 
1970 

[Koh80] Kohonen, Teuvo: Content-Addressable Memories, Springer- Verlag, Berlin, 1980 
[Lan76] Lange, R G: “High Level Language for Associative and Parallel Computation with 
Staran”, Proceedings of the 1976 International Conference on Parallel Processing 
[Pot85] Potter, J. L. ed: The Massively Parallel Processor, MIT Press, Cambridge, 1985 
[Pot88] Potter, Jerry L.: Data Structures for Associative Supercomputers, pp 77-84, Proceed- 
ings of the Frontiers of Massively Parallel Computation, 1988, IEEE Computer Society 
Press, order number 892 

[Sch88] Scherson, Isaac and Smil Ruhman: “Multi-operand Arithmetic in a Partitioned As- 
sociative Architecture”, Journal of Parallel and Distributed Computing 5, (1988) pp 
'655-668. 

[Sto86] Stolfo, Salvatore J. and Daniel P. Miranker: “DADO: A Tree- Structured Architecture 
for Artificial Intelligence Computation”, pp 1-18, Annual Review of Computer Science, 
Annual Reviews, Palo Alto, 1986 


11 



March 92 


DESIGN OF A LEAF CELL 
FOR THE RUTGER’S 
CAM ARCHITECTURE* 


Donald E. Smith, Keith M. Miyake, 
and J. Storrs Hall 

LCSR-TR-180 


Laboratory for Computer Science Research 
Hill Center for the Mathematical Science 
Busch Campus, Rutgers University 
New Brunswick, New Jersey 08903 


This work was supported by the Defense Advanced Research Projects Agency 
and the National Aeronautics and Space Administration under NASA-Ames 
Research Center grant NAG 2-668. 



1 Hardware Design 


Our Mar91-Aug91 progress report described the Rutger s CAM architecture as a collection 
tree sitting over a set of LEAF cells each with its own memory. Figure 1 shows this archi- 
tecture and identifies the two cell types used in its implementation: LEAF cells that are 
composed of a processor and associated memory, and TREE cells that constitute the Col- 
lection Tree. These two cell types serve complementary functions within the architecture: 
TREE cells provide global processing for data collection, data movement, and parallel pre- 
fix(scan) operations over the LEAF cells while LEAF cells implement CAM-like operations 
and support local SIMD processing. 
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Figure 1: Collection Tree sitting over a set of LEAF Cells 

A Register Transfer Level (RTL) description of the tree cell has been completed and 
tested. It supports both right-to-left and left-to-riglit integer scan operations, activity con- 
trolled and segmented operations, extended integer precision, as well as all tree operations 
described in our Mar91-Aug91 progress report. 

The LEAF cell specification is nearly complete and the first draft of an RTL design is 
nearing completion. In the next six months, we expect to complete the LEAF cell design, 
interface it with our TREE cell design, and test these using the shortest path algorithm 
described in our Sep90-Feb91 progress report. 
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1.1 LEAF Cell Specification 


Each LEAF cell is responsible for performing CAM-like operations and is composed of a 
main processor connected to three support processors via multi-ported interface registers. 
These support processors are the tree processor, memory processor, and 10 processor (not 
shown in Figure 1). The LEAF cell’s main processor is composed of two communicating 
components; a k-bit processor 1 for standard operations and a 1-bit processor used to control 
activity within a LEAF cell. Figure 2 shows the interconnection of these components. 


1.2 Component Descriptions 

The k-bit processor is a three bus (i.e., GBo, GBi, GB2) architecture composed of: 

• a k-bit ALU 

• five general purpose k-bit registers GRj, GR2, GR3, GR4, and GR5 

• a dual ported k-bit flag register(FR) directly coupled to the fourteen 1-bit registers 

• a dual ported (k+l)-bit tree register(TR) that provides the interface between leaf and 
tree processor 

• a tri-ported k-bit refresh register register(RR) that provides the interface between the 
leaf, memory, and 10 processors. 

• a dual ported 10 register(IOR) that provides the interface between the 10 processor 
and the refresh register (RR). 

The 1-bit processor is a four bus (i.e., BB 0 , BB^ BB 2 , AC) architecture composed of: 

• a 1-bit ALU 

• five general purpose 1-bit registers BRi, BR 2 , BR3, BR4, and BR5 

• a dual ported l-bit segment register(SEG), read by the tree processor, that indicates 
if a leaf processor is (SEG=1) or is not (SEG=0) the first element in a new segment. 

• a dual ported 1-bit status register(VLD|), read by the tree processor, that indicates 
if the tree processor should use (VLD|=1) or ignore (VLDf=0) the data in the tree 
register. 

• five 1-bit status registers (OVERFLOW, ALLZERO, CARRY, SIGN, LOB) that con- 
tain the status of the k-bit ALU 

‘We expect the k-bit processor and its associated registers to be 32-bits wide; however, our design is not 
restricted to 32-bit widths but parameterized as a function of word width. 
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Figure 2: LEAF cell architecture 
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• a dual ported 1 -bit status register(VLDj), written by the tree processor, that indicates 
if the leaf processor should use (VLDj=l) or ignore (VLDj=0) the data in the tree 
register 

• a 1-bit constant (ONE) used to activate LEAF cells 

1.3 Interfaces between the 1-bit and k-bit components 

The 1-bit and k-bit components communicate through the processor status registers (i.e., the 
five 1 -bit registers that maintain the status of the k-bit processor), through activity control, 
and through dedicated connections joining the fourteen 1-bit registers and the low-order 14 
bits of k-bit register FR. 

Activity control is determined by the value carried on AC in the 1-bit processor and is 
used to enable or disable the writing of both 1 -bit and k-bit registers from their respective 
output buses (i.e., GB 2 and BB 2 ). The value placed on AC can be obtained from the output 
of the the 1-bit ALU (BB 2 ) or from one of the 1 -bit registers BRi, BR 2 , BR3, BR 4 , BR 5 , 
ALLZERO, or ONE. 

The dedicated connections between the 1-bit registers and the low-order 14 bits of FR 
provides an additional interface that increases the bandwidth between the 1 -bit and k-bit 
components. These connections are used by the following operations. 

• copy all 1 -bit registers into FRo through FR 13 

• copy FR[0:4] into BRi through BR 5 

• copy FR 5 and FR& into SEG and VLD ^respectively 

1.4 Instruction Fields 

Operation of the LEAF cells is specified by the fields summarized in Figure 3. The estimated 
width of each field is shown in parentheses. These specifications are tentative; refinements to 
field size and instructions will be made as algorithms are implemented on this architecture. 


GOP 

GOP encodes the operation to be performed by the k-bit ALU. The exact set of operations 
has not been determined but includes AND, OR, XOR, addition, and subtraction as well as 
sel-argi, sel-arg 2 . These last two instructions select the indicated input argument and route 
it to the output. 

Rgb 0 and R GBl 

Rgb 0 and RgBi encode the refresh register, one of the k-bit general registers, the tree register, 
or the flag register. Our initial design using 5 general registers is encoded as follows: 
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GOP (4): encoded operation to be performed by the k-bit ALU 
Rgb 0 (3)- encoded designation of the source register driving bus GB 0 
R gb , (3): encoded designation of the source register driving bus GB X 
Rgb 2 (3): encoded designation of the destination register reading bus GB 2 
RRC(2): refresh register control 
ADDR(lg(n)): memory address 
WR(1): memory processor write control 
IOC(l): 10 register control 

OVR(l): Enable/disable control of k-bit ALU’s override capability 
BOP(4): encoded operation to be performed by the 1-bit ALU 
R BBo (3) : encoded designation of the source register driving bus BB 0 
R B Bi( 3): encoded designation of the source register driving bus BBi 
Rbb 2 (3): encoded designation of the destination register reading bus BB 2 
Rac(3): encoded designation of the source driving the activity control line AC 

Figure 3: Instruction Fields 
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000:RR 001:GRi 010:GR 2 011:GR 3 100:GR 4 101:GR 5 110:TR 111:FR 


Rgb 2 

Rgb 2 encodes a null destination(A), one of the k-bit general registers, tlie tree register, or 
the flag register. The encoded designation determines the register that will record the value 
on GB 2 • The value is recorded if and only if the LEAF processor’s activity control, as 
determined by AG, is set. This field cannot specify the refresh register. Our 5 register 
design is encoded as: 

000: A 001:GRi 010:GR 2 011:GR 3 100:GR 4 101:GR 5 110:TR 111: FR 


RRC 

RRC is a 2 bit field that specifies what data, if any, is written to the refresh register. The 
field encodes one of four possibilities: a noop(A), write from GB 2 , write from memory, or 
write from IOR. Specifying this field independent of the Rgb 2 field allows the refresh register 
to be written in parallel with any of the destinations specified by Rgb 2 - It also isolates the 
leaf processor from the memory and 10 processors allowing each to run at its own optimal 
speed. Activity control is used when RR is being written from GB 2 - it is ignored for all 
other cases. Our current design encodes RRC as follows: 

00: A 01: RR <- GB 2 10: RR M[ADDR] 11: RR «- IOR 

ADDR 

ADDR is a lg(n) bit field, where n is the number of words of memory, that specifies the 
memory location to be read from or written to. This field is required only when a read or 
write is being performed. 


WR 

WR is a 1-bit field that indicates when data is to be written from the refresh register to 
the memory (M[ADDR] <— RR). This operation can be performed in parallel with other 
operations that access the refresh register. 

IOC 

IOC is a 1-bit field that specifies when data is written from the refresh register to the IO 
register (IOR <— RR). 

OVR 

OVR is a 1-bit field that enable or disables the override capability of the k-bit ALU. When 
enabled the k-bit ALU will perform either the operation specified by field GOP or override 
that specification and transfer data from input bus GBq to output bus GB 2 . The override 
capability is used for computing inclusive scans from exclusive scans as well as providing 
MUX-like capabilities to the k-bit ALU (see section 1.6). 
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BOP 

BOP encodes tlie operation to be performed by the 1-bit ALU. The bits are the entries in 
the truth table of the boolean function being computed. 


Rbb 0 ? Rbbh and 

Rbb 0 , RbBh and R AC each encode 8 registers (not the same registers). These encodings are 
based on the interrelations between the 1-bit registers and a typical instruction mix. The 
specific registers connected to each bus and their encodings will be altered as experience is 
gained with algorithms on this architecture. The following are design goals which influence 
these choices as well as an initial assignment of registers to busses. 


• Extended precision requires that the CARRY and ALLZERO status outputs from 
the k-bit processor be fed back into the processor’s CARRY and ALLZERO inputs. 
Consequently, data paths that allow parallel routing of the CARRY and ALLZERO 
registers to tlie k-bit processor must be supported. 

• Inclusive scans are completed in the LEAF processors using the results of the exclusive 
scan produced by the tree and two additional steps 2 performed by the LEAF processor. 
These steps makes use of the 1-bit ALU as well as the override feature of the k-bit 
ALU and require that 1-bit registers VLD| and SEG be presented in parallel to the 
1-bit ALU. Details of these operations are described in section 1.6. 

• Rbb 0 encodes one of BR2, BR3, BR4, BR5, VLD|, SIGN, ALLZERO, and VLDJ. 

• field RbBi encodes one of BRi, BR 3 , BR 4 , BR 5 , SEG, LOB, CARRY , and OVERFLOW 

• The field R^c encodes one of BB 2 (i.e., the output bus of the 1-bit ALU) BRi, BR 2 , 
BR 3 , BR 4 , BRs, ALLZERO, and ONE 


Rbb 2 encodes one of null(A), BRi, BR 2 , BR3, BR 4 , BR5, SEG, VLDf . This field determines 
the register that will record the value on BB 2 . This value is recorded if and only if the 
processor’s activity control, as determined by AC, is set. Our initial design is encoded as. 

000:A 001:BRi 010:BR 2 011:BR 3 100:BR 4 101:BR S 110:SEG 111:VLDT 

1.5 Interaction between the LEAF and memory processors 

The read and write commands form the conceptual interface between the LEAF processor 
and the memory processor. They are executed by the memory processor and cause data to 
be transferred between RR and the memory. These commands are encoded in two fields, 
RRC and WR. The read command is encoded in the RRC field as a command to move 
data from memory to the refresh register. This command causes the memory processor to 

2 Plus scans only require one additional step. 
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transfer the data at the location specified by AD DR to the refresh register. If the read is 
destructive, as is the case with a DRAM implementation, the main control unit will issue a 
write command to rewrite the contents of RR back to memory. 

The write command specified by WR indicates that data is to be written from RR to 
memory and must not conflict with RRC - the WR bit CANNOT specify write- to-memory 
when RRC specifies read-from-memory. These commands are not affected by the processor’s 
activity control. 

Once initiated by either a read or write the memory processor performs the data transfer 
asynchronously. The LEAF processor should not change RR while the memory processor is 
busy and is thus limited to using RR only when the memory processor is idle. 


1.6 Interaction between TREE and LEAF processors 

Most of the interactions between the TREE and LEAF processor are of the standard variety; 
however, there are two types of interactions that require special attention. One is the 
handling of overflow between these two processors. The other is the communication of 
scan results between the two processors. 

These interactions are handled through the interface provided by TR, VLDj, SEG, and 
VLDJ,. TR acts as a bidirectional data port connecting the two processors. When the tree 
accepts data from the leaves it reads VLD| and SEG. VLDj indicates if the data in TR 
should, or should not, participate in the tree operation while SEG indicates if the leaf is, or 
is not, the first in a segment. These two l-bit registers can be read and written by the l-bit 
ALU; however, they may only be read by the tree. 

When the tree provides data to the leaves it use VLDJ. to indicates if the data in TR 
should, or should not, be used by the leaf. VLDj can only be read by the LEAF processor 
and can only be written by the TREE processor. There is dedicated hardware in the tree that 
computes VLDJ, as a function of the SEG, VLD|, and the direction of the scan (left-to-riglit 
or right-to-left). Changing any of these three fields will cause VLDj to change. 


1.6.1 Overflow Interactions 

Since the TREE and LEAF processors jointly participate in numerical computations, the 
LEAF processor must be responsive to overflow information generated by the TREE pro- 
cessor. This information is stored in a single bit in TR and must be used by the LEAF 
processor’s k-bit ALU to determine its overflow condition. The overflow condition of a LEAF 
cell operation (even data movement such as GR; TR) that involves the tree register as a 
source must be dependent on the overflow condition generated in the tree. If TR indicates 
an overflow from the tree, the LEAF processor must complete the specified operation and 
set the overflow status to true. The overflow status must also be set if the LEAF processor 
performs an arithmetic operation that itself generates an overflow. 

In the case when the tree is not supplying data to the leaf (i.e., VLDj is 0), the overflow 
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field in TR is ignored. 


1.6.2 Inclusive and Exclusive Scans 

Wlien an inclusive scan is desired, tlie LEAF processor must compute it from the TREE 
cells exclusive scan and its own internal data. 

These operation are performed using the override capability of the k-bit ALU which 
permits the ALU to either execute the operation specified or to ignore the operation and 
replace it with a data transfer. The data transfer routes the data on input bus GB 0 directly 
to output bus GB 2 . When an override is in effect the operation specified by GOP and the 
data on input bus GBi are ignored by the k-bit ALU. 

Figure 4 shows a operation and its overridden counterpart. Notice that the data transfer 
that replaces the specified operations ignores the fields Rqb, an( l GOP but uses the fields 
R GBo and R G b 2 without alteration. 


Operation Mnemonic Operation field specification 

Operation GR^ <— (GRj op GRfc) Rgb 0 = L RGBi = GRfc, Rgb 2 — L GOP— -op 

Overridden Counterpart GR^ «— GRj RGB 0 = j> RGB 2 =i 

Figure 4: Comparison of normal operation and it overridden counterpart 

In order to produce inclusive scans from exclusive scans two cases must be considered. 
One, when the tree processor provides data on which the inclusive scan depends (VLD|=1 
A SEG=0) and two, when it does not (VLD|=0 V SEG=1). Notice that these conditions 
are functions not only of the results produced in the tree but also of the segmentation bit in 
the leaf processor. This latter dependence is due to the fact that the first leaf processor in 
a segment (SEG=1) must ignore the data it receives from the tree since this data is from a 
different segment. The LEAF processor must be able to distinguish between these cases and 
complete the inclusive scan appropriately. Figure 5 shows the operations required of the the 
LEAF processor for these two cases. 


Tree data should be used GR{ <— (GRj op TR) 

Tree data should be ignored GR^ <— GRj 


Figure 5: Leaf operations for forming inclusive scans from exclusive scans 

Since the second of these operations is an overridden version of the first, the choice 
between the two operations can be performed by the override capability of the k-bit ALU 
by using the single instruction show in Figure 6. 
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OVR 


k-bit operation 


1-bit operation 


on GR, <- (GR, op TR) A <- (VALj A SEG) 

Figure 6: Using OVR to form inclusive scans from exclusive scans 

1.6.3 Using the k-bit ALU as a Multiplexor for MIN and MAX scans 

Tlxe OVR field can also be used to cause the k-bit ALU to function as a MUX routing one 
of its inputs, GBo or GBi, to GB 2 . This is accomplished by using OVR and the 1-bit ALU 
in conjunction with the k-bit operation sel-arg 2 as shown below. Notice that the sel-arg 2 
routes the second argument to the output and that its overridden counterpart routes the 
first argument. 


GR; <— sel-arg 2 (GRj,GRfc) 

where: sel-arg 2 performs GR; GRfc 


The ability to use the k-bit ALU as a MUX also provides support for forming inclusive 
scans from their exclusive counterparts. It is especially useful for MIN and MAX scans 
because these operations are performed differently in the TREE and LEAF cells 3 . In a 
TREE cell MIN and MAX are functions that output either the MIN or MAX of their inputs 
but in a LEAF cell, MIN and MAX are simulated by comparing two data and selecting one 
based on the result of the comparison. This difference requires that inclusive MIN and MAX 
scans use the two steps show in Figure 7 to convert an exclusive scan to an inclusive scan. 


GR; 4 - MIN(GR,,TR) 

OVR k-bit operation 1-bit operation 

off A (GR, - TR) BR! <- (VLD| V SEG) 

on GR; 4 - ( sel- ar g 2 ( G R,- , T R) A «- (BRi V SIGN) 

Figure 7: LEAF cell steps to complete MIN/MAX inclusive scans 


While the comparison (GR,- - TR) is taking place in the k-bit processor, the 1-bit pro- 
cessor decides if TR should be used to complete the scan - this result is stored in BRi. In 
the second step the k-bit processor is used as a multiplexor selecting either GR, or TR and 
routing it to GR;. Since OVR is on, the specific selection is determined by the 1-bit ALU 
computation. GR, is selected when either the tree data is not to be used or the contents of 
GR,- is less than TR. 

3 Addition is performed identically in the LEAF and TREE cells. 


10 




1.6.4 Inserting identity elements 

TREE cells perform exclusive scans (i.e. the initial element in a segment is the identity 
element for the operation) but do not insert the identity element into the result. In order 
for the result to contain the identity element the LEAF processor must insert it. 

This is accomplished using the MUX-like capabilities of the LEAF processor to choose 
between the data provided by the tree and the identity element for the operation. Figure 8 
shows the constants that are expected to be of special interest. These five constants can be 
constructed from a 3-bit field in which one bit specifies the high order bit, one the low order 
bit, and one the internal bits. 



Constant 

Use 

0 

00..00 

0 

identity for -f, OR, MAX on positive integers 

1 

11. .11 

1 

identity for AND, MIN on positive integers ; decrement 

1 

00..00 

0 

identity for MAX on 2’s complement 

0 

11.-11 

1 

identity for MIN on 2’s complement 

0 

00..00 

1 

increment 




Figure 8: Important Constants 


1.7 Activity control 

Activity control is used on all k-bit and 1-bit registers. These registers latch their input 
values at the end of each execute cycle if and only if they are selected by Rgb 2 or Rbb 2 
and AC is set. There is no activity controlled write- to- memory; however, the effect of this 
operation can be obtained by using the standard memory operations in conjunction with an 
activity controlled operation on RR. Figure 9 shows how this is accomplished. 


M[ADDR] <— GRj in all active leaf nodes 
RR <- M[ADDR] 

RR <— GRi mediated by AC 

M[ADDR] <- RR 

Figure 9: Activity Controlled write to memory 
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Chapter 1 
Introduction 


The design of the CAM chip has been done in a UNIX software environment using a design 
tool that supports the definition of digital electronic modules, the composition of these mod- 
ules into higher level circuits, and event-driven simulation of these circuits. Our tool provides 
an interface whose goals include straightforward but flexible primitive module definition and 
circuit composition, efficient simulation, and a debugging environment that facilitates design 
verification and alteration. 

The tool provides a set of primitive modules which can be composed into higher level 
circuits. Each module is a C-language subroutine that uses a set of interface protocols 
understood by the design tool. Primitives can be altered simply by recoding their C-code 
image; in addition new primitives can be added allowing higher level circuits to be described 
in C-code rather than as a composition of primitive modules - this feature can greatly enhance 
the speed of simulation 1 . 

Effective composition of primitive modules into higher level circuits is essential to our 
design task. Not only are the standard features of a description language required but in 
addition, features such as recursive descriptions of circuit composition, parameterized module 
descriptions, and strongly-typed port types are essential to efficient circuit design. These 
features are supported by our design tool’s composition language which allows the user to 
specify a hardware description in a C-like syntax. Parameterized modules, recursive and 
iterative descriptions, macro-like capability to describe collections of wires (i.e., cables), and 
decision making support that allows context sensitive module expansion are provided by 
our tool. In addition, our tool can determine the cost of a circuit based on the costs of its 
primitive modules. This feature is not exact but does provide a good approximation of the 
complexity of the designed circuit. 

Simulation is performed by an event- driven simulator that handles gates as well as tri- 
state bi-directional busses and provides the user not only with a view of what a circuit is 
computing but also control over the circuit so that design flaws can be effectively isolated 

1 Converting a higher-level circuit into a primitive module is straightforward when the timing of the 
primitive module need not be identical to the higher level circuit. Higher level circuits can be converted to 
primitive modules with identical time performance; however, the conversion process is much more complex. 
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and corrected. The simulator is controlled with a command language which allows the user 
to see a wire or set of wires, as well as change the values on wires. Operations can be done 
immediately (i.e., at the time the user enters them) or scheduled to take place at a specified 
time. Simulations can be run for a specific period of time or until a certain condition is 
detected in the hardware. They can be controlled from the keyboard or indirectly from a 
file. 

The design tool consists of two main parts: the command and definition languages. The 
definition language is used to read circuit definitions. The command language controls the 
actions of the simulator. These topics are detailed in sections 2.5 and 2.6. 

Following is a brief introduction on the definition and creation of circuit models. It 
defines many terms used later. 


1.1 Circuit Definition 

The circuit definition language describes connections between primitive objects. These ob- 
jects, called primitive modules , have functionality predefined in the design tool. Primitive 
modules have a special set of entry points which are connected when forming the circuit 
model. These entry points are called ports and the connections between them are referred to 
as signals or wires. There is a causality between connected modules. Execution of a module 
may affect modules connected to it. 

The design tool reads descriptions using a definition language. The language consists of 
two types of object definitions: module and cable. Cable definitions group related signals 
together. Module definitions specify primitive modules and their connections. A module may 
define other modules as children of itself, and specify connections between its child modules. 
In this case the module is referred to as a composite module. 

The circuit is built from a set of hierarchical module and cable definitions. Flattening 
the hierarchy produces the basic model of a set of primitive modules connected by wires. 

In order to name objects in the hierarchical design, hierarchical names are used by the 
design tool. These names specify objects which cannot be directly referenced within the 
current context. This is done by supplying a list of names specifying a path to the object. 
Each field in the composite name is separated by the dot character ‘ . \ 


1.2 Model Creation 

The creation of a circuit model is performed in phases. 

When a cable or module definition is read, its syntax is checked and the definition is stored 
as a master definition. These definitions may have input arguments which need assignment. 

When a module is created, input arguments to master definitions are assigned resulting in 
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a new definition type. These definitions, which have specific input arguments, are referred to 
as definition instances. Each definition instance is fully examined, checking the consistency 
of the connections made and the referenced modules. 

The circuit model is made from these definition instances. The model is designed for 
speed in simulating the functionality of the circuit and contains all structures necessary for 
simulation. Such a model is called a generated module or simulation instance. 
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Chapter 2 

The Definition Language 


The definition language is used to describe digital electronic circuits by building hierarchical 
structures connecting primitive modules. A definition file consists of a sequence of module 
and cable definitions. Modules come in two types: primitive and composite. 

Primitive modules are the basic building blocks of the definition language. These objects 
perform operations defined by C functions which have been precompiled into the design tool. 
A list of primitive modules is given in appendix B. 

A composite module definition defines submodules of itself and connections to be made 
among their ports, as well as its own ports. Attaching submodule ports causes interactions 
between the operations of the respective modules. 

Cable definitions allow signals to be identified in groups, which simplifies connection of 
ports. 

Module and cable definitions are similar in structure and are analogous to functions in 
a conventional programming language. They may have formal input arguments and may 
use other definitions (as well as their own) recursively. Termination of such recursion is not 
assured. 


2.1 Syntax Conventions 

The language syntax descriptions used in this manual is a variant of the Backus-Naur form. 
Following is a list of syntax rules: 

1. Boldface type denotes reserved words. 

2. Lowercase words, which may have embedded underscores, denote syntactic constructs. 

3. Character tokens are shown using typewriter type. Most punctuation characters are 
used as character tokens, with exceptions stated below. Note that the exceptions are 
printed in Roman type. 
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4. The vertical bar ‘|’ separates alternate syntax items when it is used at the beginning 
of a line. 

5. Square brackets (*[’, c ]’) enclose optional items. 

6. The dollar sign c $’ in a syntax rule denotes the remainder of the line as a comment. 
‘S 1 is not used in the syntax. 

2.2 Variables and Assignment 

A variable is a name associated with an integer value in a module or cable definition. Vari- 
ables have no meaning outside the current definition. A variable name may be any valid 
string token (quoted or unquoted; see appendix A). There are no arrays of variables. A 
variable may have the same name as signals, modules, or cables since its context is distinct. 
There are three variable types: input, loop, and assignment. Within a specific definition, a 
variable may be used as only one type. 


Input Variables 

Input variables are arguments to a module or cable definition. They are determined at 
invocation and may not be reassigned within the current object. These variables are valid 
throughout the current object. Each input variable of a definition must be given a value 
upon use. 


Loop Variables 

Loop variables are used in for loops in the component section of modules. Each for loop 
controls the assignment of a single loop variable. Loop variables are only valid within the 
controlling loop, and may not be reassigned within the loop. 


Assignment Variables 

Assignment variables are used in the component section of modules. They are set using the 
assign statement ( string -token <- arith-expr ; ). This assigns the current value of the 
expression to the variable. Once a variable has been assigned to, it is valid until the end 
of the module. Each subsequent use of the variable gets the assignment value unless the 
variable has been reassigned. Assignment variables may not be reused as loop variables. 

Control flow variations resulting from if statements or loops may allow an assignment 
variable to be referenced prior to assignment. 

Cables only have input variables since they have no component section. 
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2.3 Expressions 


Expressions are used in various ways to control the assembly of modules. There are two types 
of expressions: arithmetic and logical. Arithmetic expressions result in integer values. Logical 
expressions return one of the values TRUE or FALSE. Arithmetic and logical expressions 
are not interchangeable. 


Arithmetic Expressions 


Arithmetic expressions compute integer values. They may be string tokens (variables), 
numeric tokens (constants), or may be created by application of an arithmetic operator 
to one or more arithmetic expressions. 


aritluexpr := 

string-token 
| numeric-token 
| - arith_expr 
| ( arith_expr ) 

| arith_expr * aritluexpr 
| arith-expr / arith-expr 
| arith-expr */, arith-expr 
| arith.expr + aritluexpr 
| arith.expr - aritluexpr 


$ variable value 
$ constant value 
$ arithmetic negation 
$ arithmetic grouping 
$ multiplication 
$ division 
$ modulus 
$ addition 
$ subtraction 


Division operations return a truncated result (using C convention), since integer division 
is not exact. 


There are three levels of arithmetic operator precedence. Unary operators (negation and 
grouping) share the highest precedence. Multiplication, division, and modulus (*, /, */.) have 
equal precedence, below that of the unary operators. Addition and subtraction (+, _ ) share 
the lowest precedence. 

All binary arithmetic operators associate left- to- right. 


Logical Expressions 

Logical expressions compute the value TRUE or FALSE. They are constructed by the use 
of relational or logical operators. Relational operators produce a logical expression based on 
the validity of a relational query between two arithmetic expressions. Logical operators use 
one or two logical expressions to produce a single logical expression. There are no logical 
variables or constants. 
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log.expr : = 

arith-expr > aritli.expr 
| aritli_expr >= aritli.expr 
| aritli.expr < arith-expr 
| aritli.expr <= aritli.expr 
| aritli.expr = arith_expr 
| aritli.expr ! arith-expr 
| ~ log.expr 
| { log_expr } 

| log_expr & log_expr 
| log.expr I log_expr 


$ greater than 
$ greater than or equal to 
$ less than 

$ less than or equal to 
$ equal to 
$ not equal to 
$ logical negation 
$ logical grouping 
$ logical AND 
$ logical OR 


Note that the vertical bar in the logical OR represents the character 1 1 \ 


The use of arithmetic expressions as operands eliminates precedence or associativity with 
regard to relational operators. 

There are three levels of logical operator precedence. Unary logical operators (negation 
and grouping) have the highest precedence, followed by logical AND. Logical OR has the 
lowest precedence of logical operators. 

Logical grouping syntax is distinct from that of arithmetic grouping. This reinforces the 
idea of noncompatibility between expression types. 


2.4 Naming Conventions 


Each child object (signal, cable, or submodule) in a definition must be given a unique name. 
This allows unambiguous signal naming within simulation instances (for design verification). 
Names of child objects must be string tokens (quoted or unquoted). 

An object name may have a single associated array index. This index is specified by 
an arithmetic expression enclosed in square brackets following the name. The string token, 
excluding the array index, is called the root name of the object. 

object .name : = 

string-token 

| string-token [ arith-expr ] 

Note that the square brackets do not represent optional arguments. 

Example: 

The root name of an object “a[5]” is simply “a”. 

It is often useful to name lists of objects. In this case, a modified array notation, called 
an object list, may be used to specify a range of array indices. This is done by supplying 
a start and end index for the array, separated by a colon ‘ : \ The notation is equivalent to 
supplying each object name in order, beginning with the start index, and iterating until the 
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end index is reached. If the start index is less than the end index iteration increments by 
one, otherwise it decrements by one. 

object .list := 

string-token [ arith_expr : arith_expr ] 

Note that the square brackets do not represent optional arguments. 


Example: 

“a[l:2] w expands to u a[l]” “a[2]’\ 

“a[2:l]” expands to w a[2]” “a[lj”. 

A list of objects may contain single object names and object lists. 

object -name Jist := 
object-name 
| object _Jist 

| object-name object .name Jist 
| object Jist object _nameJist 

In a module definition, each internal subcomponent or signal must have a distinct name 
(root name and index). Also, objects with the same root name must have similar types. 
This means that signals, components and cables may not share root names. Furthermore, 
components or cables which share a root name must share the same master definition. These 
checks are performed during the creation of module definition instances. 

It is often necessary to name an object which cannot be directly referenced from the 
current level. In this case a composite name is used, using the dot character ‘ . ’ to separate 
levels. This is referred to as a hierarchical name. 

hierarchical-name : = 
object _name 

| object -name . hierarchical-name 
Example: 

The hierarchical name “a.b” refers to an object “b” which is a child of object 
“a” , where “a” is a child of the current module. 

Hierarchical naming may be used with array expansion, in which case rightmost indices 
are expanded first. 

hierarchical Jist : — 
object .name 
| object Jist 

| object .name . hierarchical Jist 
| object Jist . hierarchical Jist 

Example: 

u a[l:2].b[3:4]” expands to “a[l].b[3]” “a[l].b[4]” “a[2].b[3]” “a[2].b[4]’\ 
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The hierarchical analog to an object name list, called a hierarchical list, may now be 
defined. Note that every hierarchical name is also a hierarchical list. 

hierarchical _name_List := 
hierarchical .list 

| hierarchical Jist hierarchical_name_list 


2.5 Cable Definitions 

A cable represents an ordered list of signals, each signal having an associated type. Signal 
typing is used to ensure that the use of a module is consistent with its definition. 

cable-definition := 

cable string-token [ ( variableJist ) ] typed_signal_list end 

variableJist := 

string-token 

| string-token , variable-list 

typed-signalJist := 

signal-name -list signal-type 
| signal -name Jist signal-type typed -signal -List 

signal-name Jist := 

object _n anieJist 
| cable_use 

| object .name Jist signal-name Jist 
| cable_use signal_nameJist 

signal -type := 

input 
| output 
| inout 

The string token following cable is the cable name. This name is used for future refer- 
ences to the cable. The variable list is a list of input variables for the cable. When the cable 
is used, each input variable must be given a value. The typed signal list is a list of the wires 
which comprise the cable. It may include cables uses, which is defined below. Each signal 
is given one of three allowable types: input, output, or inout. The meaning of the types 
will be described in section 2.6. 


Example: 

cable cl 

si s2 input 
s3 output 
s4 inout 

end 
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In the example, signals “si” and “s2” are both input. 

After a cable has been defined, it may be used anywhere that a signal may be used. This 
includes being used in other cable definitions. The following syntax defines a name as a use 
of a cable, 
cable ..use := 

cable string-token [ ( argument Jist ) ] object_name 
| cable string-token [ ( argument Jist ) ] { object _name Jist } 

argument Jist := 
aritk_expr 

| aritli-expr , argument Jist 

The string token following cable is the name of a cable definition. The argument list 
given must be exactly the same size as the number of input variables to the cable definition. 
Following the arguments is the list of new cable instance names. 

Example: 

cable cl cil 
cable cl { ci2 ci3 } 

The first use defines a single instance “cil” of cable “cl”. The second use defines 
two additional instances, “ci2” and “ci3”, of cable “cl”. 

When a cable is used in another cable definition, the type of the resultant signal depends 
on both the signal type given in the previous definition, and the type given to the cable use. 

The following matrix shows the retyping rules: 
cable type subsignal type 

input output inout 

input input output inout 

output output input inout 

inout inout inout inout 

After a cable instance has been defined, each use of the instance name represents the 
list of its component signals in order. Each signal name in the list is a hierarchical name 
consisting of the cable instance name and the component signal name. Individual signals 
within the cable may be accessed by naming the signals hierarchically. 

Example: 

cable c2 
si input 
s2 output 

end 


10 



cable c3 

cable c2 scl input 
cable c2 sc2 output 
cable c2 sc3 inout 

end 

In cable “c3”, signal “scl. si” would be input and “scl.s2” would be output. 
Because of retyping, signal “sc2.sl” would be output while “sc2.s2 would be 
input. Both subsignals of “sc3” would be inout. 


Example: 

If we make a instance “ci4” of tlie cable type “cl”, individual signals may be 
referenced as “ci4.sl”, “ci4.s2”, “ci4.s3”, and “ci4.s4”. This set of signals, in 
order, can be referenced simply as “ci4”. 


Cable definitions may use other cable definitions, including those which are not yet 
defined (forward referencing). There is no check for recursive cable references, which do not 
terminate. 


2.6 Module Definitions 

Two types of modules (primitive and composite) are used in circuit designs. Primitive mod- 
ules are objects with predefined functions. Composite modules define connections between 
primitive and composite modules. 

module -definition : = 

module string-token [ ( variable-list ) ] [ cost-section ] 

[ port-section ] [ signal ^section ] [ component .section ] end 


The string token following module is the module name. This name is used to reference 
the module in future use. As with cable definitions, when a module is used each input 
variable must be given a value. 

Additional module examples are given in appendix C. 


Cost section 

The cost section is used to estimate the relative expense of building modules using several 
technologies. Each module definition instance lias associated cost values. These costs may 
be explicitly defined in the cost section, or may be implicitly defined as the sums of the 
costs of its submodules. Primitive modules should define explicit costs with a cost section. 
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Composite modules should include a cost section if the hardware implementation of the 
module does not correspond to the functional model represented by its subcomponents. 

costjsection : = 

costs cost_pair Jist 

cost-pair Jist := 
cost -pair 

| cost-pair cost-pair Jist 

cost -pair := 

nmos : arith-expr 
| cm os : arith_expr 
| gatelnput : aritluexpr 

Currently there are three cost criteria: nmos, cmos, and gatelnput. If a technology 
cost is given more than once, the last cost pair is used. 


Example: 

module ml(vl) 
costs 

nmos: 2*vl 

cmos : 10 

gatelnput: 20 

end 

Port section 

The port section is an ordered list of the external connections of the current module. Ports 
are special signals which are used to connect to the module in later uses. A module with no 
ports cannot be referenced by another module. The order of port signals is important and 
determines proper connection of the module. 

port .section : = 

ports typed -signal Jist 

The typed signal list is the same as used in cable definitions, with the same subsignal 
retyping rules. 

Type information defines the proper use of the signal in the module and what connections 
are allowed if the module is referenced by a composite module. 

input implies that the signal is generated from an external source. 

output implies that the signal is generated within the current module. 

inout does not state the source of the signal. It causes the signal to be a 
(bi-directional) bus, which must be driven by tri-state drivers. 
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The following rules govern valid connections to each type of port signal within the current 
module: 

input: No output signal may be connected to the signal. At least one primitive descendent 
module must use the signal as an input. 

output: At least one primitive descendent module must use the signal as an output. 

inout: At least one primitive descendent module must use the signal as an input or output. 
Additionally, every connected output must be a a tri-state driver (the signal is a bus). 

Missing or inconsistently typed signal connections are reported upon creation of module 
definition instances. 


Example: 

module m2 
ports 

pi input 
p2 output 
p3 inout 

end 

Signal section 

Tlie signal section defines internal signals of the current module. These internal signals must 
be distinct from port signals and may not be referenced by other modules. Every signal used 
in a module definition must be defined in either the port or signal section. The order in 
which internal signals are defined is not important. 

signal-section : = 

signal signal _n ameJi s t 

Each signal defined in the signal section is given a special type of internal. If a cable 
use is defined in this section, all resulting signals are also typed as internal. 

The internal type means that the signal is both generated and used by primitive de- 
scendents of the current module. 


Example: 

module m3 
signals 

si s2 s3 


end 
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Component Section 


The component section determines how composite module are built from other modules. 
This is accomplished by ‘executing 5 component statements in order, similar to conventional 
programming languages. Primitive modules, whose functions are defined by C code, do not 
use their component sections. 

comp orient .section := 

components component_statement_list 

component-statement Jist := 
component .statement 

| component-stmt component .statement Jist 


component .statement := 

submodule-statement $ 

| assign-statement $ 

| join .statement $ 

| error-statement $ 

| grouping-statement $ 

| ifjstatement $ 

| forest at ement $ 

| whilejstatement $ 

I break-statement $ 


declare a child module 

assign a value to a variable 

create a link between a group of signals 

print an error message 

group multiple statements 

execute statements conditionally 

execute a statement loop iteratively 

execute a statement loop conditionally 

exit from loops 


Submodule Statement 

Submodule declares a module as a child of the current module. It also designates attach- 
ment of signals to the ports of the child module. 

submodule_statement : = 

object _name string-token [ ( argument-list ) ] hierarchical_nameJist ; 

The initial object name is the local name given to the submodule. This name is used to 
refer to the child module within the current module. Specifially, it is used in hierarchical 
naming. The next string token is the name of a module definition. The number of arguments 
given must match the number of input variables of the module definition. Next is a list of 
signals to be attached to the ports of the child module. Because every signal must be 
declared in the port or signal section, references to cables use hierarchical names (and not 
cable uses). Each signal in this list will be connected to the corresponding port of the 
previously defined module in order. The signal list must be the same size as the number of 
ports of the previously defined module. The port and connecting signal must conform, using 
the rules stated under the port section. 


Example: 

module m4 
ports 


14 



m4i input 
m4o output 
signals 
icl 

end 

module m5 
ports 

m5i input 
m5o output 
components 

scl m4 m5i m5o; 

end 

In the example, module “m5” defines a child module of type “m4” and gives it 
the local name “scl”. The hierarchical name which refers to the signal “icl” in 
“m4” is “m5.scl.icl”. Note that the ports of w m4” and the connecting signals in 
“m5” correspond in type. 


Assign Statement 

Assign associates an integer value with a variable. The target of assign may be any unused 
variable name, or an assignment variable. Execution of assign causes the expression value 
to be computed and assigned to the variable. 

assign-statement := 

string-token <- arith_expr ; 

The string token names the variable to be assigned. 

Example: 

module m6 

components 
vl <- 2; 
v2 <- vl * 2; 
vl O 1; 

end 

The first assign creates a new variable “vl” with a value of 2. The second creates 
“v2” and uses “vl” to compute “v2” as 2*2 = 4. The final assign changes “vl” 
to 1, but does not affect “v2”. 


Join Statement 

Join merges a set of signals to form a single signal. After a join lias been completed, any 
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member can be used to represent the set in other component statements (including joins). 

Every signal used in a join must be declared in the port or signal section of the module. 

The merging of signals caused by a join may introduce non-obvious inconsistencies in 
the connection of modules. These inconsistencies are reported upon execution of the join. 

join_statement := 

join [ liierarcJiical-LiameJist ] ; 

Note that the square brackets above do not indicate an optional argument. 


Example: 

module m7 
signals 

si s2 s3 
components 
join [si 
join [s2 

end 


s2] ; 
s3] ; 


The first join merges signals “si” and “s2”. The second merges the signal “S3” 
with the signal which is the join of “si” and “s2”. 


Error Statement 

Error allows the user to print a message during the course of module generation. The 
message is a single string (no variables), and is designed mainly for identifying situations 
that should not occur. 

error statement : = 

error string-token ; 

Execution of error causes activation of an error message with the error flag mask acti- 
vated. The error mask value is given in appendix D. These messages may be suppressed or 
may cause program termination by options in the simrc file. 


Grouping Statement 

Grouping allows multiple component statements to act as a single statement lexically. This 
allows multiple statements to be used as targets in if, for and while statements. Grouping 
has no affect in other contexts. 

grouping -statement := 

{ component .statement -list } 

Note that there is no semicolon following the grouping statement. 

Grouping does not affect the lexical scope of any variable. 
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If Statement 


If allows conditional execution of a statement depending on the result of a logical expression. 
Multiple statements may be executed by the use of grouping. 

if statement := 

if log_expr component .statement [ else component-statement ] ; 

If executes the first component statement when the logical expression is TRUE. When the 
logical expression is FALSE, the second component statement (in the optional else clause) 
is executed if available. 


For Statement 

For executes a statement a specified number of times. Multiple statements may be executed 
by the use of grouping. 

For evaluates two bounding expressions once to find the inclusive range for its loop 
control varable. The target statement is then repeatedly executed with the loop variable 
set to each value in the range. The loop variable is initially set to the value of the first 
expression. If the first expression is less than the second expression, the loop variable is 
incremented by one after each iteration; otherwise the variable is decremented by one. 

The loop variable is not allowed to be a input or an assignment variable, and may not 
be assigned within the loop. This guarantees termination of for. 

for^statement := 

for string-token = arith.expr , aritluexpr component statement 


While Statement 

While executes a statement as long as a logical expression remains TRUE. Multiple state- 
ments may be executed by the use of grouping. 

While first evaluates the controlling logical expression. If it is TRUE, the target state- 
ment is executed, and while is reexecuted. If it is FALSE, execution continues at the 
statement immediately following the while. 

Termination of while is not guaranteed. There is no check for non- termination. 

while-statement := 

while log_expr component -statement 


Break Statement 

Break is used to halt processing of for and while statements. Break disregards pend- 
ing statements in the current target component, and continues execution at the statement 
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immediately following tlie current for or while statement. 

Break takes an argument which is the number of nested loops to break. A nonpositive 
argument has no effect. If the argument is larger than the number of nested loops, creation 
of the module is completed at the break. Break does not affect parent modules. 

break-statement := 

break aritli_expr ; 
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Chapter 3 

Command Syntax 


The command syntax controls what actions are taken. These commands control the defini- 
tion and execution of circuit models. 

Commands are normally read from standard input. They may be directed from a file by 
using an input flag or a command statement. 


3.1 Filenames 

A special syntax is accepted to facilitate the use of filenames. Filenames are allowed to 
be string tokens separated by periods 4 . \ This allows specification of most local filenames 
without having to use quoted strings. 

file _name := 

string-token 

| string-token . file_name 

Quoted strings must be used in order to use the UNIX directory structure. 


3.2 Current Generated Module 

The name of the last generated module to be referenced is saved. This is known as the 
current generated module. The current generated module is used when commands are issued 
which omit the optional module name. The current generated module is automatically set 
by generate, and may be changed using set. 
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3.3 Current Submodule 


Each generated module has a single current submodule. The current submodule is used as 
a shorthand notation for a single submodule in each generated module. This allows simple 
reference to the submodule during testing. 

The current submodule is referenced by beginning a command name with character 

The current submodule of a simulation instance is initially the top level module generated. 
It may be changed using set. 


3.4 Parent Constructor 


The command naming syntax contains a parent constructor As each field is read in the 
left-to-right expansion of a hierarchical name, the partial name corresponds to an object in 
the current module. When the parent constructor is read, the new object referenced by the 
partial name is set to the parent of the current object. 

The parent constructor is usually used in conjunction with the current submodule 0 . 

Use of the parent constructor with an array of child modules may cause problems. 


3.5 Command Naming 

Names in the command syntax are similar to hierarchical names in definitions. There are 
additional rules which apply to command names: 

• The name of the current generated module or the current submodule identifier <3 must 
be the first field in the hierarchical name. 

• There is a parent constructor 4-0 which changes the target to the parent of the current 
target. 

We now define an object name in the command syntax. 

command _object := 

<9 

| string .token 

| <3 . command_object_tail 

| string -token . command .object Jail 
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comm and job je c t _t ail 

| object _name 

| ~ . command-object _tail 

| object _name . command-object Jail 

We also define a list of command objects defined by array expansion. This corresponds 
to the hierarchical list in the definition syntax. 

command-list := 

G 

| string -token 
| <9 . command-list -tail 
| string-token . command Jist -tail 

command Jist -tail := 

| object .name 
| object Jist 

| ~ . command Jist -tail 
| object-name . command Jist.t ail 
| object Jist . command Jist -tail 

Finally, a general list of command names is defined. Note that every command object is 
also a command list. 

command .object Jist 
command-list 

| command Jist command-object Jist 
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3.6 Design Tool Commands 


conmian d _s t at ement : = 
s on r ce _s t at emeu t 
| clear .statement S 

| generate_st at ement 
| run_statement 
| reset -statement 
| destroy -statement 
| read_s tat ement 
| close .statement 
| pause-statement 
| assignment -statement 
| sliow_s tat ement 
| showvector_statement 
| timed-statement 
| set -statement 
| repeat .statement 
| quit .statement 


read in a definition file 

clear all current definitions 

generate a module for simulation 

simulate a generated module 

reset signals in a generated module 

destroy a generated module 

read commands from a file 

close an open command file 

transfer control 

assign values to signals 

display signal vectors 

display signal vectors as a group 

execute a command during simulation 

set options 

loop read a command file 
exit tlie program 


Source Statement 

Source reads in a file of module definitions. The entire file is read using the stated definition 
language rules. Module syntax is checked as the definition file is read. Definitions are checked 
for consistency only when referenced by a generate. 

source-statement := 

source file-name ; 

If source causes a module to be redefined, the new definition will be used only in future 
module definition instances. Previously defined instances will continue to use the previous 
definition. 


Clear Statement 

Clear deletes all definitions and simulation instances. This is equivalent to restarting the 
design tool. 

clear .statement : = 
clear ; 

Generate Statement 

Generate creates a simulation instance of a module definition. The module should have been 
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previously read using a source. The simulation model is used to test correctness of designs. 
Generating a module causes all consistency checks on submodule use to be performed. The 
consistency rules have been stated along with the definition syntax. 

Each generated module is set to a special initial state in which all signals have an unde- 
fined value. 

Gereration of a module causes it to become the current generated module, 
generate .statement := 

generate string-token [ ( argument-list ) ] ; 

The string token specifies the module to be generated. The arguments given must match 
the number of input variables of the module. 


Run Statement 

Run executes a simulation run of a generated module. Events queued for the module are 
evaluated until all events have been processed or a halt command has been issued. Run is 
detailed in section 4. 

run .statement := 

run [ string -token ] ; 

The string token specifies the generated module to be run. If omitted, the current 
generated module is run. 

A simulation run may be aborted by an interrupt signal (control-C). Such an interrupt 
sets command input to the interactive level, or exits the design tool if it is being run in batch 
mode. Aborting a simulation run does not affect pending events. 


Reset Statement 

Reset causes a generated module to be set to its special initial state. All signals in the 
simulation instance are set to the unknown value and all events are removed from the event 
queue. 

reset .statement : = 

reset [ string -token ] ; 

The string token specifies the generated module to be reset. If omitted, the current 
generated module is reset. 


Destroy Statement 

Destroy frees a generated module which is no longer needed. Destroying a module does not 
affect any other generated modules or any definitions. 
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destroyjstatement 

destroy [ string-token ] ; 

The string token specifies the generated module to be destroyed. If omitted, the current 
generated module is destroyed. 

If tlie current generated module is destroyed, it will be ill-defined until reset by a set or 
generate. 


Read Statement 


Read causes commands to be read from a file. The file is read until the end-of-file is 
reached, or a pause is executed in the file. Commands are again read from the current 
source following exit from the named file. 

read .statement := 

read file-name ; 

The file is closed after a read if the entire file has been read. If the named file is already 
open, read continues at the current position in the file. 

Close Statement 

Close closes a file left open by a previous read. This allows a file containing a pause to be 
reread from the beginning of the file. 

close^statement : = 

close file_name ; 

Pause Statement 

Pause stops reading of the current source of command input. Command input is then read 
from the previous source. 

pause .statement : = 

pause ; 

A pause in a command file causes reading of the file to stop. The file is kept open, and 
a subsequent read will continue at the command following the pause. In the interactive 
(top) level, pause exits the design tool. 

Assignment Statement 

Assignment sets a signal value in the current generated module. Assignment events are 
put into the event queue of the current generated module. These events are completed on 
the next run of the module. 
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assignment -statement 

comm and .signal .list <- numeric -token ; 

Tlie command signal list has been described under command naming. 

The numeric token may be a binary, octal, hexadecimal, or decimal number. In all cases, 
the number is converted to a binary representation then assigned in order with the least 
significant bit assigned to to the rightmost signal. A 0 bit corresponds to logical low, while 
1 corresponds to logical high. Only the logical high and low values may be assigned. 

If the numeric value is binary, octal, or hexadecimal, the number of bits given must 
‘match 1 the number of signals to be assigned. This means that exactly the minimum number 
of data bits needed to assign a value to every signal must be given. 

Assignment of a value to a bus is not recommended. Assignment of decimal values to 
signal lists longer than 32 bits is not supported. 


Show Statement 

Show prints the value of a list of signals in a generated module. Each signal is printed 
individually, giving the signal value and time of last change. 

A signal may have the following values: 

0 logical low 

1 logical high 
U undefined 

X bad signal value 
T tri-state value 

show .statement := 

show command-signalJist ; 

Show works differently when used with a bus. If a primitive output port onto the bus 
is named, the value of the port is given, otherwise the computed bus value is printed. This 
enables all inputs to a bus to be printed, as well as the bus value. 


Showvector Statement 

Showvector prints a numeric equivalent of the signal values for a list of signals. The list of 
values is interpreted as a binary number, with the least significant bit corresponding to the 
rightmost element. Logical low corressponds to a 0 bit, while logical high corresponds to a 
1. This is consistent with assignment rules for signals. The resulting composite value is 
printed as a decimal number. If any signal has an abnormal value (not logical low or high), 
each signal is printed individually using show. 

showvector -statement : — 

showvector command-signalJist ; 
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Showvector does not support signal lists containing more than 32 elements. 

Timed Statement 

Timed statements store commands in the event queue of a generated module for execution 
during a run. Only three types of statements may be used as timed statements: assignment, 
show and pause. Timed statements have an initial argument which is the amount of 
simulation time to pass before the statement is executed. They are put into the event queue 
for execution. 

timed .statement := 

numeric .token : assignment .statement 
| numeric -token : show -statement 
| numeric-token : pause-statement 

pause functions differently when used as a timed statement. A timed pause halts the 
current simulation run, returning control to the command level which initiated the run (not 
necessarily the level which produced the pause). This is similar to halting a run via an 
interrupt. The other statement types function normally. 

Timed statements compute their target signals before being entered in the event queue. 
The present value of the current submodule is used for decoding ‘®\ 


Set Statement 

Set is used to change values used by the design tool. There are two things which may be 
changed with set: the current generated module, and the current submodule (of the current 
generated module). 

The current generated module is the default used when certain operations do not specify 
a module name. The current submodule is used as the initial path object in command names 
which have ‘<3’ as the initial field. 

set .statement : = 

set simulation string-token ; 

| set <3 command-object ; 

In the first variation, the string token refers to a simulation instance. This instance 
becomes the current generated module. 

In the second, the new current submodule l <3’ is specified by the command object. The 
command object must be a module, not a signal or cable. The previous value of l Q’ may be 
used to specify the new object. 


Repeat Statement 

Repeat causes repetitive reading of a command file until a test passes. It leads a signal 
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value as a test for completion once each time the file is read. 

There are two versions of repeat: while and until. 

While checks the signal before reading the file, Execution continues as long as 
the signal value is logical high. 

Until reads the file before checking the test signal. It continues execution as 
long as the value is not logical high. Until always reads the file at least once. 

Note that the two versions use inverse testing conditions, 
repeat .statement := 

repeat file_name while command-signal ; $ check before loop 

| repeat file-name until command-signal ; $ check after loop 

If multiple tests are needed for the halting condition, the halting function must be de- 
signed in hardware. 

There is no check for termination of repeat statements. 

Quit Statement 

Quit causes normal termination of the design tool. No state is retained between execution. 

quit .statement := 

quit ; 
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Chapter 4 

Implementation Details 


A simulation run iteratively executes primitive modules affected by changes to their input 
signals, then updates the value of their output signals. This continues until the simulation 
instance reaches a steady state, or a halt command is processed. 

Each event in a simulation instance has an associated integer processing time. Events 
with the same processing time are completed in a single time step, and are processed before 
any event with a greater processing time. The last processing time executed is known as the 
current processing time. Simulating in time steps allows the current processing time to serve 
as an indicator of the amount of time a circuit takes to execute. 

Following are specific implementation details of the design tool: 


4.1 Primitive Modules 

Primitive modules perform functions predetermined by C code. These modules have a 
uniform delay characteristic 8 > 1, meaning that a change on any of its inputs causes a 
change in its outputs exactly 8 time units in the future. 

The delay characteristic must be positive to satisfy the processing time requirement. 

Uniformity ensures consistency in the output of a primitive module. Uniformity is needed 
because the simulation model does not throw out events. If the delay characteristic was 
nonuniform, a single module could cause schedule signal value changes on the same wire out 
of order. 


4.2 Simulation Construction 

To speed simulation, generated modules are flattened. Flattening removes the definition 
hierarchy from a simulation instance. Only instances of primitive modules and connections 
between them remain after flattening. This speeds execution, since the definition hierarchy 
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is not traversed during simulation. Flattening constructs connection lists for each signal that 
specify which primitive instances affect it and are affected by it. 

The definition hierarchy is retained and is used to reference the flattened structure. 


4.3 Bus Signals 

Bus signals, which are driven by tri-state drivers, are built in a special way. Each primitive 
module on a bus writes to a specific entry point, similar to a port of a module. The bus 
value is calculated based on the values of its entry points. Every primitive module reading 
from the bus gets the calculated bus value. 

Busses also have special handling for printing. A bus name which corresponds to an 
output of a primitive module prints information about the corresponding entry point. Any 
other name corresponding to the bus prints information about the calculated bus value. This 
allows for easier examination of busses. 


4.4 Simulation Events 

In order to satisfy the processing time requirement, events are stored in and read from a 
priority queue. This is implemented in the design tool by a heap. 

The queue contains three types of events: signal value, printing and halting. 

• Signal value events specify changes in the value of a signal. These events cause affected 
primitive modules to be executed. 

• Printing events cause printing of signal information. 

• Halting events stop execution of a simulation run following the current time step, 
instead of waiting until stable state. 

Events may be created by a command statement, or as an effect of executing a primitive 
module. 


4.5 Simulation Runs 

Each simulation run reads and processes events until all events have been processed or a 
halting command has been processed. 

Each time step of the run is conducted in phases. 

1. All current events are extracted from the priority queue. 
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• Signal value events cause the target signal to be immediately updated. Each 
update causes connected busses and primitive modules to be scheduled for eval- 
uation. Modules and busses are kept in separate evaluation lists. If a signal is 
updated more than once in a single time unit, an error message is printed. 

• Printing events get stored in a list for later processing. 

• A halting command sets a flag to exit the simulation run following the current 
time unit. 

2. Busses scheduled in the first phase are evaluated, based on the value of all signals 
connected to it. This may cause schedule additional primitive modules for evaluation. 

3. Each primitive module in the evaluation list is processed. The C code for each affected 
module is executed. This may change internal state and may schedule additional 
simulation events. Because of the delay characteristic of primitive modules, events are 
always scheduled for a later processing time. 

4. Printing commands are executed. This shows the signal state at the end of the current 
processing time. 

After these phases are completed, the simulation stops if the halting flag is set. Otherwise, 
the next time step is processed. 
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Chapter 5 


Startup Options 


The design tool has a number of options which are set at the beginning of execution. They 
are separate from command statements and do not change during execution. These options 
control general input and output characteristics of the design tool. 

Normally commands are read from stdin and output is written to stdout. Error mes- 
sages are directed to stderr. Input and output may be redirected using execution arguments. 
Error messages may not be redirected. 


5.1 Command-line Arguments 

A number of options may be set on the command line, 
sim [ optionJist ] 

Acceptable command line options are: 

-i <filename> Read commands from the named file instead of stdin. This causes batch 
mode execution, rather than interactive. 

-o <filename> Direct output messages to the named file instead of stdout. Output 
messages result from command statements, specifically the printing statements (show 
and showvector). An output file should only be specified when in batch mode (-1). 

-n Turn off debugging messages. Debugging messages are useful in verifying a circuit design. 
Debugging causes the design tool to print additional information about each created 
circuit and signal information each time a signal changes value. 
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5.2 simrc File 


Upon startup, additional information is read from a file named simrc, which should be in 
the current directory upon program execution. 

Comments in simrc are specified by the pound sign similar to other syntax rules. 
Numbers in simrc are interpreted using the C “strtol function. These do not conform to 
the conventions used in other parts of the design tool. 

There are four options which may be specified in simrc: 


Debugging Messages 

Debugging messages may be supressed with the single keyword no_print. This produces 
the same result as the -n command-line argument. 


Fanout 

Fanout is a crude measure of the drive/load ratio on signals. Signals with large numbers of 
inputs or outputs are more likely to have load problems. An rough estimate of signal load is 
produced by comparing the number of inputs and outputs of each signal to a user- specified 
number. A warning message is printed for each signal which has a fan-in or fan-out greater 
than the fanout value. 

The fanout value is specified with the keyword max_fan, followed by an integer. The 
number should use C syntax. 


Error Printing 

Error messages may be supressed by specification of an error printing mask. Only errors 
specified by the mask get printed. The list of error types and their corresponding mask 
numbers are shown in appendix D. 

The error printing mask is specified with the keyword print_mask, followed by an integer. 
The number should use C syntax. 


Error Halting 

Execution of the design tool may be halted by use of an error halting mask. Encountering 
an error specified by the mask causes the design tool to exit. The list of error types and 
their corresponding mask numbers are shown in the appendix D. 

The error halting mask is specified with the keyword halt_mask, followed by an integer. 
The number should use C syntax. 
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Any error specified by tlie halting mask always prints before exiting, even if it is not 
specified for printing (print _mask). 
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Appendix A 
The Lexer 


A lexer is used to convert input into tokens. 

The lexer recognizes four primary types of tokens: 
single character tokens 
string tokens 
reserved words 
numeric tokens 

Tlie lexer uses spacing characters (space, tab, newline) to separate tokens, but they are 
not passed along. 


Comments 

The character “#’ is used to signify a comment. When a comment character is read, the 
remainder of the input line (until the next newline) is disregarded. Commenting does not 
work within a quoted string. 

Single Character Tokens 

Tlie single characters tokens recognized by the lexer are: 

(9 C .1 i .1 < I 5 ‘< 7 *>’ 1 C 

> 5 * 7 >> ’ ’ ’ ’ ’ V 7 

T, T, T, T, ‘>\ ‘-\ ‘ v , 

7\ ‘0’, T, 

Single character tokens do not need to be separated from other tokens by spacing char- 
acters. 

Non- alphanumeric characters which are not single character tokens or one of the special 
characters V, ‘#\ and are disregarded. 
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String Tokens 


The lexer recognizes two types of string tokens: quoted and unquoted. 

An unquoted string consists of an initial alphabetic character or underscore _ followed 
by any number of alphanumeric characters or underscores. Unquoted strings are checked 
against the list of reserved words. If an unquoted string matches a reserved word, it is 
passed to the simulator as the reserved word token. 

A quoted strinq is a succesion of characters enclosed within two delimiting quote symbols 
Quoted strings allow acceptance of strings which do not qualify as unquoted strings. This 
is used for filenames and message printing. A quoted string may not cross a line boundary. 
Quoted strings are not checked against reserved words, so they are always passed as string 
tokens. 

Here is the string syntax given as regular expressions: 
unquoted .string := [a-zA-Z_] [a-zA-Z_0-9] * 
quotedjstring := "?*" 

In the regular expressions, square brackets denote a choice between characters. ? rep- 
resents any single character. means a sequence of zero or more of the previous character 
or choice of characters. 

There is currently no way to pass a string containing the newline character. 


Reserved Words 

Reserved words are strings which have special meaning in the design tool. Each unquoted 
string read by the lexer is checked against the list of reserved words. If a string matches a 
reserved word, it is passed as the reserved word. 

There are two categories of reserved words. The first is used when reading definitions, 
the other when reading commands. 

Reserved Definition Words: 


break 

cable 

cmos 

components 

cost 

else 

end 

error 

for 

gate Inputs 

if 

inout 

input 

join 

module 

nmos 

output 

ports 

signals 

ts -inout 

t s .output 

while 




Reserved Command Words: 




clear 

close 

destroy 

generate 

pause 

quit 

read 

repeat 

reset 

run 

set 

show 

showvector 

simulation 

source 

until 

while 
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Numeric Tokens 


Four types of numeric tokens are recognized by the lexer: biliary, octal, decimal, and hex 
adecimal. These correspond to numbers in base 2, 8, 10, and .16 respectively. 

Binary, octal, and hexadecimal numbers have ‘0’ as their initial character. The second 
character specifies the base of the number. 

• l b’ or ‘B’ specifies a binary number. This is followed by a sequence of the characters 
‘0’ or ‘1\ 

• ‘o’ or ‘O 1 specifies an octal number. This is followed by a sequence which may contain 
characters corresponding to the numbers 0-8. 

• ‘x’ or ‘X 1 specifies a hexadecimal number. This is followed by a sequence which may 
contain characters corresponding to the numbers 0-9 or alphanumeric characters in the 
range a-f (upper or lower case). The characters a-f represent the decimal values 10-15 
respectively. 

If the second character does not fall into the above categories or if the leading character 
is a number which is not c 0’, the numeric token is a decimal number. A decimal number is 
a sequence of characters, each of which corresponds to a number in the range 0-9. 

Each syntax is repeated below as a regular expression, 
binary .number :=0[bB][01]* 

octal-number := 0[o0] [0-8]* 

hexadecimal_number := 0[xX] [0-9a-fA-F]* 

decimal-number := [0-9] [0-9]* 

In the regular expressions, square brackets denote a choice between characters. ‘*’ means 
a sequence of zero or more of the previous character or choice of characters. 

All types of numeric tokens are interpreted as having the most significant digit on the 
left. 
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Appendix B 

Primitive Modules 


This appendix contains tlie current list of predefined primitive modules. Each primitive is 
shown as a module definition, with associated costs, and is accompanied by a short descrip- 
tion. 

Improper input values to primitive modules cause uncertain results to occur. These 
results should not be relied upon. The following rules generally apply: 

• bad signal values propagate. 

• If no bad signals are present, undefined signals propagate. 

• If no bad signals are present, tri-state signals cause undefined output. 


Because primitive modules are specially defined, some of their functions cannot be re- 
produced by general composite modules. 


Constant 

const allows signals to be hooked to a constant source. The input argument becomes the 
source value. Valid argument values are ‘O’ (logical low) and ‘1’ (logical high). Use of other 
values is not recommended. 

Constant values cause attached modules to execute on the first run following module 
generation and after a simulation instance has been reset. 

module const (v) 

# the constant has zero costs 
cost nmos: 0 cmos : 0 gatelnputs: 0 
ports 
v output 

end 
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Inverter 


inv does a logical inversion of the input signal. Valid input values for a are logical low 
and high. 

• If “a” is low, “x” is set to high. 

• If “a” is high, “x” is set to low. 


module inv 

cost nmos : 2 cmos: 2 gatelnputs: 1 
ports 
a input 
x output 

end 


Logical NAND 

nand computes the logical NAND of the input signals. Valid input values for the inputs a 
and “b” are logical low and high. 

• If either signal is low, “x" is set to high. 

• If both signals are high, w x” is set to low. 


module nand 

cost nmos: 3 cmos: 4 gatelnputs: 2 
ports 
a input 
b input 
x output 

end 


Logical NOR 

nor computes the logical NOR of the input signals. Valid input values for the inputs a 
and “b” are logical low and high. 

• If either signal is high, “x” is set to low. 

• If both signals are low, V is set to high. 
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module nor 

cost nmos: 3 cmos: 4 gatelnputs: 2 
ports 
a input 
b input 
x output 

end 


Delay 

delay simply propagates a signal value with a time delay. The output signal is set to the 
input signal, regardless of the value. The input argument is the time to delay the output, 
which must be a positive number. 

The cost of a delay is represented as a pair of inverters, 
module delay (delta) 

cost nmos: 4 cmos: 4 gatelnputs: 1 
ports 
d input 
q output 

end 


Transmission Gate 

trans_gate sets the output “q” to the value of the input u d when enabled with the enable 
signals “el” and “e2”. When not enabled, “q” is set to the tri-state value. The transmission 
gate is a dual-rail model, which means “el” should always be the logical inverse of e2 . 

• When “el” is high (“e2” is low), “q” gets the value of “d”. 

• When “el” is low (“e2” is high), “q” gets the tri-state value. 

“q” must be hooked to a bus signal. This means that all ports which output to the bus 
must be typed as tri-state. In particular, only trans.gate outputs and SRAM data lines 
may output to the same signal as “q”. 

module trans_gate 

cost nmos: 1 cmos: 2 gatelnputs: 2 
ports 
d input 
el input 
e2 input 
q ts .output 

end 
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Positive Latch 


posLatch is a single bit of non-volatile memory. It uses “P to control when data is read 
into memory from “d” . The value in memory is output through “Q” . 

• When “1” is low, the latch holds state. 

• When “1” is high, the memory value (and W Q”) is set to the value of “d”. 

module posLatch 
cost nmos: 8 cmos : 10 
ports 
d input 
1 input 
Q output 

end 


Negative Latch 

negLatch is a single bit of non-volatile memory. It uses “1” to control when data is read 
into memory from “d”. It is called a negative latch (as opposed to positive latch) because 
the sense of the latch signal “P is reversed. The value in memory is output through “Q”. 

• When “1” is low, the memory value (and “Q”) is set to the value of “d”. 

• When “P is high, the latch holds state. 


module negLatch 
cost nmos: 8 cmos: 10 
ports 
d input 
lb input 
Q output 

end 


Static RAM 

SRAM is memory for simulation instances. A static RAM module takes two arguments: 
the amount of memory and the number of bits in the word. It reads and stores data in 
addressable memory based on its control signals “rw” and “e”. “e” enables the RAM for an 
operation, and “rw’ 1 selects whether the operation reads from or writes to memory. 

• If “e” is low, the memory does nothing. 
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• If “e” is high, the memory does the specified operation. 

• If “rw" is low, the operation is a write. 

• If W rw” is high, the operation is a read. 

The signals in “D” must be hooked to busses. This means all ports which output to each 
bus must be typed as tri-state. In particular, only trans.gate outputs and other SRAM 
data lines may output to those busses. 

There is currently no way to preload data into the memory. All data must be written to 
memory before it is used. 

Data widths larger than 32 bits are not supported. 

module SRAM (amount, width) 
ports 

rw input 
e input 

A [1: amount] input 
D[1 : width] ts.inout 

end 


Dynamic Memory Test 

D_test is used to simulate dynamic RAM in conjunction with the static RAM module 
SRAM. It keeps track of the last time data was written to the address, however D_test 
does not actually store the data. If data is used too long after it has last been written, an 
error message is generated. 

The “rw 11 and “e” lines work as described for the static RAM. 

module Detest (amount) 
ports 

rw input 
e input 

A [1: amount] input 

end 
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Appendix C 
Module Examples 


This section contains two simple examples which demonstrate certain features in the defini- 
tion language. The examples have not been optimized. 

The first example is a scalable multi-input OR. It is constructed using a two-input OR, 
which in turn is built from primitives nor (logical NOR) and inv (inverter). The multi-input 
OR uses recursion to construct a collection tree. This results in an O(log(k)) running time 
as opposed to 0(k) time for a chain. 

module t wo _ input. OR 
ports 

x y input 
z output 
signals 
z_bar 
components 

xy_nor nor x y z_bar; 
z_comp inv z_bar z; 

end 

module k_input_OR(k) 

# compute a multi -input OR by recursion 
ports 

x[l:k] input 
z output 

signals 

zl z2 # internal signals for split 
components 
if {k = 1} { 

join [ x[l] z ]; # connect input to output 

break (1); # end current module; halt recursion 

> 
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# compute split information 

kl <- k/2; 
k2 <- k - kl; 

# split in half and recurse on both parts. 

zl.comp k_input_OE(kl) x[l:kl] zl ; 
z2_comp k.input_0R(k2) x[kl+l:k] z2; 

# recombine parts 

z.comp two.input.OR zl z2 z; 

end 


Tliis example uses recursion to split the tree into two subtrees, and a two _input_OR to 
recombine the subtrees. Recursion is halted when the subtree has only a single input. This 
is done by using break to end tlie module definition. 

The second example is a variable length MIN circuit. It uses a for loop to join a chain 
of single-bit MIN modules. 

module two _ input. AND 
ports 

x y input 
z output 
signals 
z.bar 
components 

xy.nand nand x y z.bar ; 
z.comp inv z.bar z; 

end 

module a.gre.b 

# set z to one if a is greater than b ((a = l) & (b = 0)) 
ports 

a b input 
z output 
signals 
b.bar 
components 

b.inv inv b b.bar; 

z.comp two.input.AND a b.bar z; 

end 


module MIN 
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# 

# 

# 

# 

# 


# 


compute MIN based on input values and selection inputs 
compute selection outputs for chaining 


-> choose x as the MIN 
-> choose y as the MIN 
= 1 is an impossible state 


(sxi = 1) 

(syi = 1) 
sxi = syi 
ports 

x y input 
z output 
sxi syi input # 
sxo syo output # 
signals 
zl z2 z3 
sxi.bar syi.bar 
x_gre_y y.gre.x 
sxo.e syo_e 
components 

compute the min value z 
x_sel two.input.AND 
y_sel two.input.AND 
xy.and two. input. AND 
z.comp k_input_0R(3) 


select control input 
select control output 


# used to find out which data input is greater 

# contains new select information 


x sxi 

y s yi 

x y 
zl 


zl; 
z2; 
z3; 

z2 z3 z; 


sxi.inv inv sxi sxi.bar; 
syi.inv inv syi syi.bar; 

# check if values are not equal 

x.gre.y.comp a.gre.b x y x.gre.y; 
y_gre.x.comp a.gre.b y x y_gre_x; 

# compute new select information 

sxo.e.comp two.input.AND y.gre.x syi.bar sxo.e; 
syo.e.comp two.input.AND x.gre.y sxi.bar syo.e; 

# compute output selects 

sxo. comp two.input.OR sxi sxo.e sxo; 
syo. comp two.input.OR syi syo.e syo; 

end 

MIN uses information from the input select lines or by comparing the two signals x and 
y to compute the output z and the output select lines. Note that reversing the order of 
signals connected to the ports of the circuit a_gre_b changes its function. 

module k.bit.MIN(k) 

# compute a variable length MIN circuit by iteration of 
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# a chainable single bit MIN 
ports 

x [1 : k] y[l:k] input 
z[l:k] output 
signals 

sx[0:k] sy[0:k] low 
components 

# Turn off initial select signals 

low__gen const (0) low; 
join [ low sx[0] sy[0] ]; 

# Chain MIN circuits together 

for i = l,k 

bit [i] MIN x [i] y[i] z[i] sx[i-l] sy[i-l] sx[i] sy[i] ; 

end 

Tlie chain is initialized by connecting the first set of select inputs to the low signal. The 
last set of select outputs is left unconnected. 
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Appendix D 
Error Messages 


Error messages each have an associated field which describes its type. The error type is used 
to identify groups of messages for special consideration. Upon startup, a print mask and 
a halt mask are read from the simrc file. The print mask specifies error types which are 
printed. The halt mask specifies error types which halt design tool execution. Each message 
that halts execution is automatically printed. 

Following is a list of error masks and their associated groupings. Each mask is given as 
an octal constant. 


000001L 

Race condition during a simulation run 

000002L 

Corrected parsing error 

000004L 

Warning 

000010L 

Redefinition of a cable or module 

000020L 

Reference to undefined cable or module 

000040L 

Conflicting definitions 

000100L 

Uncorrect able parsing error 

000200L 

Module generation halted 

000400L 

Error in primitive module 

001000L 

Bad data found 

002000L 

Error statement executed 

004000L 

Memory allocation error 

010000L 

Error external to program 

020000L 

Inconsistency in program 
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Abstract 


. w <ivnlnit locality in communication to reduce the number 
Hierarchical interconnection networks exploit locality me 

rlass of general hierarchical mt er connection 
Of links in the networks. In this paper, we propose a class general 

»*. of interfaeenode. the appropriate ala, of d«,«r. based on prfom.^, 

eld-effectiveness n.e»m.es, W, .bow that the app.o.ob e.nsider.bly rednee, not only mtoadna. 

Lie denaity bo, also intradoste, *-* * - ** ““onTl aU, 

,b, networks. W, present a performance •» •»« 

and queueing analyses. 

Th , asymmetric topofogy «f a hierardried network degrade prfo^e, and », 

beeana, of aome heavy traffic Ms. Tte traffic dictation. in bierardneal 
important, but ao to tbere has bee. very IBth andyai. of tb, probl™. e ‘ ' 

ffi,„ib„ion on t-o-tod network, and try to 

other performance and cost-effectiveness measures. In addition, we mvestrg 

a eos.-effective hierardried network by aetting appropriate de.ign ,«-»• *n aaaocto, 

algorithm is developed. 



1 Introduction 


Muiticompn.er -Uh hundreds or thousands o, processor, 

most potential lot the next generation of supercomputers. The 

— — ■*- * »> .ions Jn 

passing organization is preferable for these systems due to the simplicity 
processors (7] [9] [16] [18]. 

For a very large system, a critical problem of the interconnection network is that the number of 
JtLd bJ L prohibits large- To tackle ,h. ptohleno, 

Unhs neeaea . . . reduce the num ber of links, have been proposed 

works, which exploit locality m communication (WINsl I5l Hypemet 

in the literature. Some -,>« «. h,«.onnec,.o. Network, ^J^b-sic 

1101 Hierarchieai Cubit Network (HCN) [8], > dust., stntttue using sh 

,-„n media |20l Hrerarehical Memory Structure (HMS) using crossbar ...tdres |H), 

interconnection me* . M, &« ^ to , be „ made with hierarchicrd 

and a two-level mesh hierarchy schem l]- , most are based 

i v, Cm* fl9l and Cedar [12], Among these networks, most are b 

interconnection networks, such as Cm [19] and C l kave been made for 

on some specific topologies such as hypercube, mesh, bus, etc. hew 
general hierarchical networks. 

TITX7 , ■ rci a dass of hierarcliical networks for message-passing systems. A 

be themselves grouped into clusters, with each cluster linked by a separate 

, , , Austel at level 2 is selected as an interface node to construct tne 

latter case one node from each cluster at ievtu * , 

k „, , - r:> "j:rr,"o 

HINs It was shown that a rLiiN is more . rpi 

ifiocdi., in communication exists, i.e., the HIN gains more pu,fo»a»<. bench, per - corn The 
l hot aho indicated the diradvan.ag.r of HIN., including high traffic den.,., over m.erchr.tm 
laid _ iu.racln.ter Unk, (. degradation in performance, and dnmffi .bed « 

“I: “e capabibty became of .be .ingle interface nod. in each clusUn. Hepheaf.on of — 
links and more sophisticated routing algorithm. - «««“•> “ ““ “ 

links. 

. . class of general hierarchical interconnection networks for message- 

In this paper, we propose a class 01 b c , 

JLZZLL. *h are designed using anew approach. «,ke the HIN, in M . Prosed 
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hierarchical networks are allowed to select any number of nodes from each clustei 

„ote The optimal , am, be, .1 iotetf.ee nodes and the opto •>*'■ "" 

based on performance and cos, .effectiveness measures Wo .ill sho. that ‘ h ' 

in, enf.ee nodes in e.eh eln.te, using the same nm.be, of link, no, on), teduee, the same — 
of int.neln.te traffic density .. rephe.tio. of hnk, does, hot also nednee. a eo.s.d.n.bl. o 

intraeluster traffic density, so that the intn.el.ste, traffic can be balanced. In add,,, on, ,t enhance, 
the fault tolerance capability of the networks. 

A major problem with a hierarchical network is that the network is usually asymmetric even 
if end, duster is symmetric, .hid, result, in some heavy ttaffic 11* that may become pot^td 
communication bottleneck. Where .odd congestion take place and how em. ,t be dictated What 
is the relationship between traffic density and Cite, performance and cos, -effect, venes, measures. 
To an..., these on, must and,., traffic distributions in hierardncai net.orks .in*, 

difficnlt. Therefore, one o, the objectives this work is to mrdyre the traffic drstr.but.ons so 
we can gain a better insight into hierarchical networks. 

We evaluate the performance of the proposed networks in terms of diameter, average internode 
distance, traffic density over links, and queueing delay with contention. We also analyze m detail 
how to design a cost-effective hierarchical network by choosing appropnate desrgu parameters. An 

associated algorithm is developed. 

This paper is organized as follows: Section 2 outlines the construction of the proposed hier- 
archical networks. In Section 3, performance and cost-effectiveness measures for the hrerardncal 
networks are studied. Some examples of the hierarchical networks are analyzed and compare m 
Section 4. Section 5 analyzes how to determine the design parameters to construct a cost-effec ive 
network. Finally, the concluding remarks appear in Section 6. 


2 Construction of hierarchical networks 

The construction of the proposed hierarchical networks can be described as follows. Let N be 
the total number of nodes in a hierarchical network. The N nodes are divided into Kl dusters 
of N/Kr nodes each. Each cluster of N/K\ nodes is connected to form a level 1 network. For 
convenience of analysis, we assume that A, evenly divides Ki-i at level *, with initially K 0 = N, 
and every cluster at the same level is of the same size. The nodes in every cluster are ordered m 
the same way, i.e., the corresponding nodes in different clusters have the same internal address. 
Then h nodes, 1 < h < N/K u from each duster are selected to act as the interface nodes. To 
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, , - tprface nod es from each duster are selected. For example, if the t* 

.» — . - — - - — - ■- “- ta 

nodes There are total of h X K, interface nodes at level 1. 

«... .... -°r ^ 

'T TZTT^K, lldesll Each b o.»^ « 2 “J '= 

< jL « — .. ... — - — .o — . - 2 - “ j 

on. 

— * — ^zrr::rr:rr:;x 

topologies. II". " O' kvt | 3 and so oh. Som. 

— - * — *“ - " “• 

rt;'-7" «* .* '™' 1 » * “ mpk “ iy “ m ” c “ d “‘“t ( Ta 

. i. . biw b „.» b . — * W - * > 0). ** -* ” 

using BH with » = 32 , K, = 4 , /, > 2, and if, = 1. 

The HU. described in [5] are spedal toes of »e«wo,k. 1™' U - ^ b 

« iL- networks « be — — 
ordinary binary Iryperenb, network of sir, Jf is . .wod„«l network w,.b 7, = »/*, • 

where hypercube connection is used at both levels. 

U .be fobowing an.ysis, -1 »— -** “! 

. , . f two leve i networks is relatively sunple and the results can 

1.1“ Jib more le.eh Afso is pointed - * H H >“ *» ^ 

number of levels in the hierarchy. For a two-level network, we assume that Ki is alway , 

one cluster for each group of interface nodes. 

3 Performance and cost-effectiveness measures 

w. now analyse the performance and cos. . d f,e.i.en«s of .he hierarchical networks. » to. 
give some definitions and make some assumptions^ Let 
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N _ the total number of nodes in a hierarchical network; 

l(i) the total number of links at level i; 

jr£) __ the number of links in a cluster at level i\ 

L _ the total number of links in a hierarchical network; 
j( 1 the number of clusters at level 1, 

,, _ ,h, number of interface nodes rdeeted from eadt **« ‘i 

— the diameter of a cluster at level t, 

Dm — the diameter of a hierarchical network; 

AD W _ the average intemode distance of a cluster at level i; 

AD - the average intemode distance of a hierarchical network; 

TDl L - the highest traffic density over links in a cluster at level t; 

TDmox - the highest traffic density over links in a hierarchical network; 

A — the message generation rate at each node; 

_ ,b, ,..o »• .ho l«oI i ^ 

^ link, max 

level; 

^(i) _ the message processing rate of each link at level t; 


w£L — the longest average delay at level t links, 


Wnax _ the longest average delay at links in a hierarchical network; 

p - the probability that the source and destination nodes of a message axe in the sam 

. L-. (ro, - .dop, . — -*■ ; 5 ” S H; 

(, _ p) i. Lb- probability that tie .ouree and da.tm.L.on «« » “»“* ' J 

L f p for • 8 iv«. darter rim, .be « Lb. lord!., of commumca... ». « » 

reads an m.erdur.er „.~ 6 = to each node in other duster -Lb ^ 
case of a node sending messages to itself is excluded. 
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Tk, four important performance — - *”* «* ^ 

de.sity over Mr, and pn.ueing delay with «— «- « **- «-• * -» "" «“ 

LLa,, — -ad briehy show the fault tolerance capability of the foo.^tai networks. 


3.1 Diameter 


The diameter of a network is the maximum 
two-level hierarchical network, the diameter is 

Dm<2Dm^ + Dm.W\ if 1 < fi < N/K 
Dm = Dnfi) + DmW; if h = N/Ki 


intemode distance between any two nodes. For a 


( 1 ) 


The formula above is derived based on the following facts: 

, !; If , stash. - lml 1 * cons.mc.ed b , * 

id„ticJ when viewed irom an, of it. vertices) .-a « hypercube and ring, the J. » he 
distance between the interim node and the nod. farthest away from it. Consnhmng the 
clusters for .hid, their interim node, at level 2 m farthest away from each other, •• can .as , 
find that 2Dm<t> + Dm(*> is the dime,, of ,h. hiemrdricrd network. For „ 
at level 1 such - • binruy tree, the may he great, than the dtatanc. fi»» rnterfa. 

node to any other node in the dust. In this cm, Dm < 2D^ + Dm'). 

m 2 < j, <*/*,: Usually Dm < 2 J-« + »•« hecan.e more interface nod, give more 
alternative paths between any two nod,. However, for some type of network, sueh a, eornple.ely 
connected .Lurk used a. level f , the distance between two non-iu.eri.e, no es ,n , he two dust, 
which are farthest away from each other is d.ay. 2Dm« + Dm'*). From ,) ^ «)■ « have 
inequality above. 

in) j = N/Ki: In this case, every node is an interface node. A message from a sour ' 

. . j * WAr • can co through a path which does not include 

cluster i to its destination node m cluster j, T h 6 

any fink in dns.e. f. Thus, the dimeter is a, most Dm"> + Dm<». On the other hand, ,f dust, 
i i, farthest away from dust, J »d the position of the destination nod. m dost, , eorrespon 
to that of the nod, in dnste. i which is farthest away from the more, node, the dr.tanee he wee 
,h, source aud destination is a, leas. Dm<» + »-»■ The eguatio. above is thus proved 


5 



2 2 Average inteniode distance 

Like diameter, average intemode distance is a fundamental property of a topology. Average 
internode drstance is the expected number of link traversals a “typical” message needs to reach it, 
destination. It is a better indicator of message delay than the diameter [17]. Average mtemo e 
distance depends on the message distribution which describes the probabihty of message exchanges 

among different nodes. 

A hierarchical network is usually not symmetric even if the networks used to construct clusters 
at each level are symmetric. As a result, the average intemode distance from different source nodes 
to all other nodes can be different. For example, the average intemode distance from an mterface 
node to all other nodes will be shorter than that from a non-interface node. However, if we take 
the average of the average intemode distances over all nodes, the average intemode drstance of a 
two-level hierarchical network can be computed as follows: 


AD <p- AD™ + (1 - p)(2ADM + AD^Y, Xl<h< W 
AD = p - ADM + (1 - pXADW + AD^y if h = N/Ki 
The derivation of the formula is similar to that of the diameter. 


(2) 


3.3 Traffic density 

Itaffic lenity over link. U another import- p.rformence — »hick ,*ct, link utita- 
tion The „ nlyei. of tr JEo density i. -port-, eep.ddl, for symmetric network, beennn, this 
mem, can indicate potent- commnnication bottlenecks. So, it may he a better (td— 
measure than diameter or average intemode distance for aeymmetr.c network, bo. traffic den,. , 
U preferable. Tiaffic density i, measured in of the average »nnrb„ of message. per bnk per 

unit time, given that each node issue, one tandem message per rrnit time (Her. “random mean, 
that the destination distribution of the messages is uniform). 

Since a hierarchical network may be asymmetric, the traffic density over each link in a cluster 
can vary Also the traffic density over a link at level 1 may be different from that over a bnk at 
level 2 Here the analysis is concentrated on the hnks with the highest traffic densrty, because they 
are potentially the bottlenecks and they determine the worst case in communication delay. For 
simplicity, it is assumed that the networks used to construct clusters at each level are symmetnc. 
Note that traffic densrty is related to the traffic distribution pattern and the routing algorithm 


6 



employed for the network. 


f 1 i at level 1: It is easy to see that the links directly 

The highest traffic dens.ty Wgbest traffic density (.from now on, we 

connecting interface nodes would be the hnks witn b over these links 

only consider the traffic density over these links). The traffic density, TD ma .x, 

consists of three parts: 

TD V , _ Traffic density generated by intracluster communications; 
x local 

TD U _ Traffic density generated by outgoing messages to other clusters; 

x out 

TD W - Traffic density generated by incoming messages from other clusters. 

generates a random message per unit time and every m 6 
q' Dfold can be computed using AD^ . 


TD 


(!) ^ 
local 


pN ADM 


( 3 ) 


K r 




.bid. .h.t lb, loll number of «,w. * m,d« bp d! fh. ,»/*. ««“«» “» ^ 

by d X) Unks in the cluster. Thus, TZ>£L, over every link m the cluster is the same. 

To calculate the traffic density generated by intercluster messages, it is necessary to specify a 

i a ' rd p routine strategy is to divide a cluster into ii disjom 
T-ontine strategy first. A natural and simple routing strategy , 

S , , ■ ( e divide a cube into subcubes), each containing an interface node. 

subdusters of equal srze ( -g., sends the message to the interface 

When a node wants to send a message to another clust , 

nol , to subcluster, to this w.y, the toterf.ee node to . subduste, i. rcpomubK tor sento g 

ted bv all the nodes in the subclnster. For the incoming messages from other 
iXTllir^the interface node has to forward them to all the nodes in the whole cluster. 
Since each cluster is symmetric, the amount of traffic through each interface node is the same. 

A t over a link connecting an interface node is 

Based on this routing strategy, the J over a mm 


TD% = (1 - vKf Ti 


N 


N 

l)?i, 1 < h < R 


( 4 ) 


where Z represents the I* fink connecting the interface node and 3 , is a fraction that gives the 

puricntos. of ou, going m.,s. 6 « Itoougb .bo I* bub over ,U lb. outgoing mestoge. d bp 

percentage 8 b of the values of i depends on the network 

the subclu ^ ter * lue of depends on both network topology and routing 

rir“l:«P Sb„. lb. ou.gotog lr.Sc, - — - . - - “ 
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,, , , - interface node and no outgoing message 

when / a = NIK U TD% = 0 because every node .s 

needs to go through the intracluster links. 

The TD$ over the 1 th link connecting an interface node is 




( 5 ) 


td£1 , = (i - p)(rt - x ) x 


AD (l) . 


if h = 


incoming messages via the m TD W above is derived as follows: 

is just for a subduster, while * is for the whole duster. The TD. n , 

., t tk , incoming <» ^ <» ^ tta 

, ‘ T , j or W i node, in otic cluster. tl« p.ob,bffiV of ~dc centog • »« t ,gc 

pm urn. time to .ho nodes in .hi, d«s.« («»p. .be in.rf.ee node) .. 0 - ri<x, M * 

Thus, the over the l 4 ' 1 link is - (1 p)(k, ^ 

... w , ea o < J, c W/ifr, the average number of incoming messages via an interface node, say 
n) When 2 < ii < «/ interface nodes of this duster may also 

A ’ * - * H ° WeVer ’. the fr Ldilfereut interface nodes to all other nodes 

go .buongb.be Unh. connee.mg ..d, .... «* ^ ^ md roulb g 

overlap. Calculation of the extra incoming X)®' wlrich is the 

worst case of I\ = 1- . 

Hi) When li = N/K u every node receives the same amount of incoming messages. 

and L ^ can be used to calculate TD inV 

„ , i: n Vs at level 1, we must consider the possibility 

To find out the highest traffic density over the links at level r, 

tb>. th. teg«. TD« »d «b. 1-grf ™£' »» «« «“ 0,, ' '* * 

W 


TD% L = roJl, + ™*(TD% + TD^,) - 


1 X, a nfTD^ is the TZ>mL f° r A = N/Ku i- e *> at ^ at 

algorithm can do. 

,, i- b. at level 2: The message distribution on the 

The highest traffic density over ffie ^ ^ symmetnc . Each no de 

links at level 2 is uniform because each level 2 uetworK 
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send? [1 r ^ L messages per 


unit time and there are 


total of K\ nodes in each cluster. Thus, we have 


TD { 21 


{1-P)N v AD^_ 


(?) 


h 


LX 


( 2 ) 


, . . , l. . Considering all links at both levels, 

The highest tr-Bc density in . hierarch, c,l network •- 

we here the Ugh,,, traffic density in a two-level Uemehresl network, 

TDmax = maxfTD^. TD^ X ) ■ ^ ^ 

a / 7 \ we can see that increasing the value of 7, reduces both Td£L**& 

Prom Eqs. (4), (5), and ( ), J ? links [51 which reduces only TD^L- 

TD&x. Therefore, it is better than replication of the level 2 links [5] 

3.4 Queueing analysis with contention 

Queueing analysis with contention is a popular meth^of 

queueing model. (M/M/1 and M/M/ r) M M based « -be following ■»>»■«■»• 

1 ) Packet-switching transmission is used. A message may consist of many P^- " * 
that a single packet - be transferred between two node, nr «... tune. 

2) Each nod, generate, message, independent of other nodes, a. rate A, and the in.ermess.g, 
times are distributed exponentially. 

31 Each link is bidirectional. It delivers a message in either direction at a time but messa S e = 
directions are considered as the traffic over the bnk. Thus, each bnk u modeled 

as a queueing center - 

, n ns messages at rate u ( ’> and the message service times are also 

4) Each level i link processes the messages at rare p 

distributed exponentially. 

5) There is infinite buffer capacity, i,„ no message is dropped due to a full buffer. Tins 
assumption has been adopted by many previous analyses of networks [5] [15], be ™ ™ * 
derivation of closed-form expressions easy and still provides a reasonable approximation 
system, especially for the “order of magnitude” evaluation of the relative performance. 
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6) The routing strategy is the same as we 


used for the analysis of traffic density, i.e., when a 


me ^ r , 

, r cluster it fust sends the message to the interface node m 
node needs to send a message to another clu 

its sub cluster. 

, , f , ,, Each level 1 link is modeled as an M/M/1 queueing center. 

*" d t y , 1 ™<’' Ju bcfo« car, b. dire..,, -d .o derive *. — 

It is easy 1° »“ O'* 1 ' . . „ [. .. j , (The differ«ce is that for traffic den.it, 

rate at the links which cause t e onge noW we assume 

we assume that on the average each node issues one message per unrt tun , 

that each node issues A messages per unit time). Thus, we have 


a! 1 ' = a -td£! 

''hnk.max 


max 


The longest delay at level 1 links is 

yy( l ) = 

r ' max 


^(D - A (1) 


link, max 


u(X) — X * T D^tuxx 


(9) 


, * i 2 . gact level 2 link is also modeled as an M/M/1 queueing center. 
The longest delay at leve . kve) 2 Ms , we have the following result: 

Using TD%L to compute the message am 


W 


( 2 ) 


max, mol 


1 1 

^(2) _ /d J ) - A • Td£L 


( 10 ) 


r~ 'link, max 

w h„ ,h, subscript W U for the -PP-ouch of choosing —*!« »>“**“ ( ” W '" “ 

“rep” for repheating level 2 links). 

Pot coruparisorr, - - — •»= — ** * k ' ■*“* "* " 

B ,ch group o, r-replicated c» be modeled <f«uc.» S cent • TD,., 

. . *. but we should always let - 1 wnen we ^ f 

be used to compute A iinfc ax , r imIt model [21 we have 

, .. T n P) a[ld ). Using the formulas for M/M/ r model FJ, we n 

the same for computing TD^ax ana Wmax) 6 


a< 2) 


‘ link.max 


= A ■ TD^l , 


\( 2 ) X-TD^'r 

^ link, max __ A 1 Umax 

fiW p {2) 


p = 


u 

A link.max _ _ 

//( 2 ) - r r 


C(r,«) = 


”r i(i - -i>)EyV:-oVd 
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( 11 ) 


jy( 2 ) 

* * max,rcp 


M 


C(r,u) 


• - 1^(1 - p) , 

where ^ ««, intensity and »« -»»*•. 

C/ewneim L. general, fo, the same .ndhc, sn M,M/r ^ ^ 

- — — - ' m/m/i -n z zji 

However, since replicating level 2 links cannot reduce TDmax 211 

„ tt ,„ be the bottleneck of , network, this npp.o.eh may result in a long., delay m a n«,»o,k 
choosing multiple interface nodes. 

The longest delay in a network, Tb, longest delay on the links in a network is the masimnn, 

Of w £ l and W&L that is, 

f max(lv£L, ; fo multiple interface nodes 


Wonri = i 


( 12 ) 


max* 


(WfiL, W& x ,rep) i for replicated level 2 links 


3.5 Cost-effectiveness measures 

hike ,h, analyse, to [5] - other ar.ieles, the cos. of a hierarehieal network is "***>“ 
total number of link, used, because one of our goal. is to —ire the link eo.t o a ne 
total number of links in a two-level hierarchical network is 

L = L M + I,n = KiLM + Iilg\ (13) 

where ,1') is the number of links in each cluster at level 1 end is the total number of li^at 
level t. Note that replication of links at level 2, say r-replicated, leads to the same 

Il = r - 

• * /in TD or Wmnx results in an increase in L, and vice versa. 
Tn general, trying to minimize AL), 1 Umax* or VY max 

Thn! Je adopt 1 prodnet, of b and AD. b and T«_ - V- I - W- - ” 

„ Which on, is more critical depends on applie.tlons and design co.ndena.ions fo, a 

network. We will use all of the — ™ the Mowing analysis. * smaller value AD. 

L x TDmax, or Lx Wmrxx is better. 


3.6 Fault tolerance capability 

* critical problem to interconnection networks of large sir, is fault toletance. Since UetarcUcal 
networks mainly fo, large systems, .be fan,. ■*,.»« capability of tb. network, must be 
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considered. A common criterion ns.d to mease, the fault toler.nee »palnlrl, of mtercotmec 
networks is juli occes.s, the c.p.Mitf of a network that provide, a connect, on fro.n an, oft, 
input sources to any of its output destinations. Under the cnter.on of full access a ne »or i 
assumed to be fault; if tier, is an; source-destination prdr that cannot be connee . ccause 
fad,; components m network. A network is -aid to be M- <— • ' « “ 
eon. Lion for an; sonrce-desfin.fion pair in .be pre.enoe of an, ta— - * » ‘ “* 
network Tb, basic idea for fault-.oierance is ,o pro„de muftiple pa.b, for a 
1 so that aitemate path, eouid be nsed in erne of faults in a patb. A fan..; component ean 
be a nod, or a link. Since a nod.-fanlt is usn* mor. sever, .ban a Iml-fanM, we consrde, only 

node-faults here. 

fn a hierarchical network, the interface nodes are critical because .be, ate tb, -bridge." con 
nesting dusters. If there is only on. interface node in each duster, a Mur, of an, ' * 

will disconnect the node, in it. dn.ter to dl .be nodes in other dusters. So. tin, tad If -«* 
etunto. tolerate sm; node-fault, and is obviously no. appropriate f.t large »<>»»•■>* cb * 
/, interface nodes in each duster, -e can construct a network which nra, t.leta.e multiple ( p 
,, _ !) .„d, faults, depending on it. reconfiguration nde and routing algorithm. Th„ mean, h, 

, network consttnc.ed using our approach -11 be more reliable than that in (5 Cons. ,™g tb- 
tb, probability of multiple fault, -thin > duster is modi smaller than that of a smg . . 

cm easily And that the reliability of a netwotk 1. enhanced very qmckl, as h mcreases. 

4 Case studies of hierarchical networks 

In ,Us sect, on, we —lyre and compare some esrample. of hierarchical networks, based 1 on 
tb. measure, given in the last section, to find out how different st.uc.ure, (topolog.es) affect 
performance and cost-effectiveness of the hierarchical networks. The network. -e choose are Bmm 
Hypereube/Binary Hypercube (BK/BH) and Complete Connedi.n/Bin.r, Hyp.rcub, (CC/BH). 
Fo , comparison, the ordinary binary hyp.rcub, (BH) is nsed a. . reference network. 


4.1 BH/BH networks 

Let N~ 2“ be the total number of nodes in a BH/BH and £ = 2™ be the number of nodes in 
each cluster at level 1. So, there are if, = clusters at level 1. I x ***** nodes are selected 

from each cluster, where 1 < h < 2 m and h is assumed to be a power of 2. Since we const er 
two-level networks, K 2 is assumed to be 1. When h > 1, - divide each cluster mto h subcu es 
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and each sub cube contains an 
results: 


interface node. Based on these assumptions. 


we 


have the following 


1. The number of links: 


= JTiLW = m2 M - 1 , 

= hi,™ = Ji(M - m)2 M_m_1 , 

w» = i‘" + t p ' = +™> ' 


2. Diameter: 

Bm £ ,/ M =(n»-W) + ( M - m ) + m=M+Tn " l0gZl - 

There are two special cases: 

„ H I, = 1, + - ^ *>“ ““ ““ BH/B “ ” e ‘” 0,,1S E ‘ V “ ta 151 

„) B /, = 2 », Dw „ = M -hid. i. .1* dinmete, of »*»“* t!T ““ bC (BH) 

networks. 


3. Average internode distance: 

pm w m — log h , M-m + m 

ADbh/bh 85 j ^ ^ ^ 2 2 2 

m (1 -p)(M- log /i) 

= 2 + 1 ’ 

4. Trnffic density: W. to. consid,, the hi E h«s. trffc density o»„ tb. IUb * W >. «“* 

™S!. - TJ3 i™i + + ’ 

Since each cluster is symmetric, we have 

(i) P N ,AD W__^l-x=^«p. 

>&Ll = -7— (TT x m2— 1 2"* - 1 V 


TD 


ifrLi 


To compete -d TI®, •• need to specify ton«in S d E oeithms. Her. w. cons.de, »■ 

different routing algorithms for comparison: 
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.. , whicll evenly distributes the incoming (outgoing) messages over all 

1) The routing algorithm w men eveiu y , , 

in .hariug the outgoing traffic »d » Unit involved hr ,h»r„ 6 «h. mcomrug 
Thus, qi = l/(m - log A) and $ = l/ m > 311(1 


7 1 = < 

-*■ ^aut.evtfn 4 


f (i-p)(£- l )=fer : ^ /l<2 ’ 

0; ^ = 2m 


n fl-pl(2~-l) \ ^ TnW < ; 1 < /] < 2' 

max(l - P, 1 J ^tn,<rvcn - m 


rn^ =l-p; 

-*• -^tn.even * 


h = 2' 


, ^T-nW - maxfTil- i) and it should 

Note that TD^ cannot be less than 1 - p because now TD in ^ - tn ,,l 

not be less than that for 7, = 2-. Then, for the routing algorithm, we have 


TD& L «*« = + TDin^tn - 


(1) 




Two special cases are 

nrD (i) =p+ ?( 1 -P)^ m z llifJ 1 = l,and 

1 J 1 Umax, even — F « m 

..a T n (i) - 1 if Ji =2™, which is equal to that for BH. 

11 j l Umax, even — *- “ -*1 1 

j, A feed routing algorithm -hid, i. the timpleet and mot. common routing algorithm uted 
to, tZ' JZ The routing rod, it computed - ft. »- Exoluaive-OE ot the .our., 
and det.in.tion addmttet. The routing code it .canned horn the mot. tignihet bri to the ~ 

tiguffircau, bit. By tr.cffig the rou.ffig — ~ ‘ " " « Uni 

half of the total outgoing messages from a subduster ( /, h . bu 

connecting the interface node and the node whose address differs only m the least srgmficant 
e g the link between nodes 0000 and 0001 in Fig. 2 (a)). The next huh (0000 - 0010) sWs 
ibLt a quarter of the total outgoing traffic (^ out of £ - D, and so on. Thus, jf we defme 
the I th link as the link that connects two nodes whose addresses differ only m t e 1 P° 
(starting with the least significant bit as bit 0), we have 


T D ^ 

1 U out,l 


(1-p) 


0; 


. i < j x < 2 m , 0 < l < m - 1 


h 
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Moreover, we also find a similar situation for the incoming messages on these links but in reverse 

order More than a half of total incoming messages to a cluster ( h out of ( , -l)g° uou S 1 

Unk connecting the mterface node and the node whose address drffers only in the most s.gmf.cant 
hit (eg the link between nodes 0000 and 1000 in Fig. 2 (b)). The next link shares almost a 
quarter , and so on. If h interface nodes are selected appropriately (e.g., for 2™ - 16, we se ect 
0000 and 1111 if * = 2 (see Fig. 2 (b) and (c)) or 0000, 0101, 1010, and 1111 rf h - 4), we can 

alS ° haVe ' i^<TD^ <(1-P)2‘; 1</, <2 m ,0<J<m-l 


tdZI, = i-p\ 

( 1 ) 


It is obvious that TD^ t rnax and TD in77Ulx 


are not on the same link. By combining TD 


(!) 

ovt,l 


and 


TD^\ we can find that 

iTi.i ’ 


+ (1 - P ) < ma x(TD% + TD^ t ) < (1 - pH 2 *" 1 + l X 1 - ^ < 2 ' 
1 1 ' l 


{ mzx(TD^ t ,t + ~ 1 p; 




h =2 m 


, , tTl ^nO) + rpW) occurs on link 0. Here we have already included the 

which shows that max(i + 1 tn,U (i-p)(I -i) 

incoming traffic via other interface nodes and through link 0, which is approximately " • 

Thus, the highest traffic density over the links at level 1 for fixed routing algonthm rs 


TD { 2 x , /ix = ThI'oL + + TD W 


iW 


Two special cases are 


(!) 

max, fix 


p + (1 - p)(2 m -‘ + 1 ) if /r = 1, and 


i) TD 

- n TD W _ 1 if T = 2 m . Note that TilSr is always 1 (the lower bound) when h = 2 m , 
no miner what routing algorithm is used. This is true because at that time, the BH/BH is 
a BH and overall traffic within a cluster is perfectly balanced. 


The traffic density over the links at level 2 can he derived as follows: 




(1 - p)N ABM _ 1^_P v 

™ { J ax , BH/BH = J. X r(2) I x 2 M_m — 1 


i(2) 


(1 - p)2 m and TD^BH/BH « 1 " respectively. 


When /i = 1 and I 1 — 2 m , T 
Finally, we have (2) 

TD inaXt BH/BH = maX ( Ti) max,0H / B//’ TD mox, BH/BH) ’ 
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m , , fl Tn (i) otTD {1) f . .depending or which routing algorithm 

where be clthcr rafflI - / “ 

is used. 

Pig. 5 shores T D ,..„,S„ - * « "" - 7 •“ trSS: 

S M l1t , io the fixed routing algorithm does not result in a significantly higher 

7 n C °ZZ- is because both algorithms can balance the intercluster traflk on the related 

1 .. . ,■ . both intercluster and intracluster traffic), we need to 

links To balance the overall traffic (i.e., botn inierciu 

consider some other routing algorithms which can balance the fink utilization w. c us e 

:::L „ w. h—. — “ d ““ ,r °' compbc 

, . The longest average delay in a BH/BH can be easily computed 

Zn « consider two cases based on two different approaches for cornparrson. «e cho. • 

W ma *,BH/BH' (2) ^ the other re plicates level 2 links (denoted 

multiple interface nodes (denoted as W 

(2) \ 

35 W B HfBH,re P r 

!) The longest delay a. level lr E.rh “ “ M/MA ,U ““ 8 C “‘"' 


W> = \-TD (,) 

■^hnk.max 


W mlx,BHfBH (1) 


max,BHlBH 1 

1 


■v m n 


a) 


2) The longest delay at level 2: 

i) Multiple bl.rf.ce nodes: Each bnh is prodded rb an M/M/1 queuebg center. 


linfc.mai 


w 


( 2 ) 


= ^' TD ^Lx. , BH/BH ’ 

1 


bh/bh, w fi(2 )_A ■Tfll.BH/BH 

», Replicated iinh. (/, = U Each group of rneplicted Ms is modeled as an M/M/ r 


queueing center. 


A • T D max t B H/BH 


u 


r(2) 


C(r, u) 


^-t(i p)e;;o l *v 


7*12 



w 


(2) 

BH/BH, rep 


-f- 

c(2) [r 


C(r,u) 
(1 -P) 


+ 1 


3) Tlie longest delay in a network: 

( W^r / d ir ,) ; if choosing multiple interface nodes 

max(W mal B/i y£jH’ ^ BH/BH, mul) ’ 

m ’ ^ | /rxr(i) tr )*, if replicating level 2 links 

^ max(iy | ^ XtBH/j BH> W BHfBH,rcp ) 1 F 

Note that when replicating level 2 link, tie value. of TOfii. and '»/>» lkOT “ b ' 

.i,U /, = 1. Fig 6 .how. Wmax BH/BH «» «* >™ ■H” 0 -*" ' . 

L, , — the number of replication) »—-* ““ 

“ »d workload, the ,evd . ft*, of the BH/BH network. -* dM *- « "* 

when p < 0-2. Fig. 6 (a) includes the r*«l„ -*r <« *• BH/BH -«* » “ 

have multiple interface node.. Also when I, < 2 (with p = 0.5) and /. = 1 (wfh P - «•»>. ■» «' 

. * wnrlrlnad Therefore Fig- 6 does not show the results for 

these networks saturate under the given workload. Therelore, g 

these cases. 


4.2 CC/BH networks 

Similar to BH/BH net.otk., let if = 2*< he the total -» of ntufain a CC/BH »d 

H- = 2 m be the number of nodes in each duster at level 1. There are if r 2 us ers a 

u interface node, ate .elected from each duster, where 1 < 1, < 2”. When f, > 1, MJ ■ ” “ 
can be sde.ted a. hrtrf.ee node. been., each dorter i. a digue. Ba.ed on the.. ...nmp.ton., 

have the following results: 


1. The number of links: 


L™ = h L^ = J,(M - , 


LccfBH 


= i (i) + £ w = 2 M - 1 (^ r ^ 1 + 2 


-!)• 


2. Diameter: 


f M — m - f 2 ; 1 < A < 

Dmcc/BH - S 

I M — m -f 1 ; Ti = 2 m 
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3. Average inter node distance: 

ADcc/bh ~ 


f p + (i-p)( M f a + 2 ); 

P + (i-p)( M f m + 1 ); -fi = 2m 


4. Traffic density: The highest traffic density over the links at level 1 is: 

TDSL = TD^ + + TD%) , 


where 


TD % L = ^71) x - 2— H2- -"0 x 1 2m - 1 ‘ 

A live 

The max(TpW,, + rD« ) can be computed as follows: 

n TD V, She each du.fr i» • dique, . Mr coor.ec.iug an 

yl . ,rj total outgoing traffic (i,., the ^ “ 

nothing (i.e., the traffic generated by a node not hr the aubdurfer). Thus, 

1 _ p . i < I, < 2 m and link l connecting two nodes both in the subcluster 
TD W (= 0 . ! < Jj < 2 m and link l connecting any node not in the subcluster 

e 0 ; h = 2™ 

2 ) „(•),. A link connecting an interface nod. share. 1/(2" - D of « hrconhng traffic via 
,he interface node. This make, the — * «f inching traffic donhi, on a ibdc ..nnectfg two 

interface nodes. So, we have 

' 1 -p; li = i 

r D (i) - . 2 (i-p) . 2 < Ii < 2 m and link Z connecting two interface nodes 

(ljp) . 2 < h < 2 m and link Z connecting any non-interface node 

3) Since TD« over the Unk connecting two interface nodes is 0, we can see that m f x{TD% + 


2 V 


18 



TD m ) is not on tins kind of hnks unless /^r.Asa result, we have 

_ (1-P)(1 +£)-•' l ^ h<r 

^{TD% + TDt\) = 


Prom the derivation above, we have 


TD^^cc/bh 


= ^ 


The 


traffic density over the links at level 2 is the same 


h = 2 ' 


i^+(i-p)(i + 7r); i<- f i< 2m 

Ij = 2 m 

that of BH/BH networks, that is, 


[ 2^1 *** 2 m ~ l 1 


TD nl x> CC/BH 


a _ P )N AD& _ izz 




II 


l< 2) 


2 M-T 71 _ ^ 


Finally, we have (2) i 

TDmax.CC/BH = maiCTUj^.cc/BH- T m^.CC/BH 

delay in a CC/BH can be computed as 


5 . The longest average delay. 
follows: 


1) The longest delay at level 


1 , Ead, lit U modeled * aa M/M/1 1»»I 


w' 


( 1 ) 


max.CC/BH (1) _ > -TD^l'CC/BH 


1 1 o. IV (2) , we also consider two different cases, i.e, 

2) The longest delay at level 2. For rnax,CCfBH * 


W$/ B ff,mvl and W CC/BH,r,v- 

i) Multiple interface nodes 


: Each link is modeled as an M/M/1 queueing center. 


(2) 


,( 2 ) 


w. 


(2) 


K link, max ~ * * ^^max^C/BH 

1 


CC/BH,mul ^-X.TD^cdBH 
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ii) Replicated links (h 
queueing center. 


— l): Each group 


of r-replicated links is modeled as an 


M/M / 1 


A -TD^ cc/bh = “ 

u=: 7,<5> " ’ r 


Wmax,CC(BH 


CM- £ + (i -/>)££>£’ 

(2) _ J_fC(lVU^ + i] 

W CC/BH , rep p (2) [ r (l - p) J 

3) The longest delay in a network: 

, w (i) W (2) , n „ if choosing multiple interface nodes 

max(W^ xCC / B H> w cc/BH,muil 

/Tir (i) W (2) V if replicating level 2 links 

max WmaxjCCtBH' W CCIBH,Ttp> 

. v, • Fie 6 for p = 0.5 and p = 0.8, respectively. Note that for given 
W^cc/BH - shown m * ^ ^ ^ ^ of the CC/B H networks saturate when 7 a 

parameter values (sues an wor )• ^ TWore , Fig. 6 gives the results only for the 

is too small (h <2 dp- ■ interface nodes or replication of links. 

CC /BH networks which have the appropria e n 

4.3 Comparison and analysis 

. . : s to see how different structures of hierarchical networks 

The purpose of ^ ^ at each level) and different design 

(e . g „ the same topology b effectiveuess The influence of setting design parameters 

* — — > b€ h ^ * the next 

section. , 

n An TD W Lx AD, Lx TD and L x W’—, respectively, 
Figs. 3-9 show Dm, AD, TD m ox, - ■ ^ ^ number 0 f interface 

for some examples of BH/BH, BH, and CC/B wi resp ^ ^ ^ ^ ^ Tke size of networks, 

nodes or the number of replication of level 2 s ^ ^ of p ^ 0 .5 and 0.8, respectively. 

is 1024 and the size of eadr cluster a - ^ ^ ^ & BR is equivale nt to a BH/BH with 

It is assumed that A — 1 and fi f 1 

j x - 16, From these examples, we can find the following. 

1) If h is small, the TD max and/or W , ma^stay on ^ men t,oned 

in Fig. 5 or Fig. 6 means that the value is from TD mal or W_) 
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in [5] (Jj is always 1). With the increase in I u the TD, nax and W max of BH/BH networks m 

tod 2 are reduced faster than that at tod 1. I'd. the given CC/.BH nd.orkt • " 

, te tod 2 links all the time, because tie degree of commotion at tod 1 is much brgh.r than .ha, 

at level 2. 

2 ) The lower bound of TD— m BH/BH is 1 which comes from the lower bound ofTD 
while that in CC/BH can be lower. Similar thing happens to WW They indicate t at it wo 
be necessary to use a topology with higher degree of connection to construct the clusters 

1, if we want to reduce TD max and in BH/BH further. 

3) For a BH/BH network (actually also for other values of A, /r<'\ and p), dioosme mulUpl 
interface nodes always leads to a shorter delay than replicating its level 2 links, because the delay 
at level 1 is usually the longest delay in the network and replicating levd 2 links cannot reduce the 

highest traffic at level 1. 

4) if ,, « toge enough, a CC/BH network with multiple interface node. may dm add™ a 

shorter delay than ,h„ with replicated tod 2 links, because the latte, rfds ,h, lege, dda, at 
,„d 1 links. Considering the fact .ha, CC netwotks have the highest degree of conned, on among 
df network topofogie. hu, they ed> s.iH result in ,h. longest dday occurring a. tod 1^ 
dusto has a -ingle intmf.ee node, w. may co.dud, that in genesd, dtoosmg " 

nodes is necessary to balancing the traffic over dl links in a network and .has a be, to gn 
approach than replicating level 2 links. 

5) Fo, both BH/BH and CC/BH, a vet, small value of 1, is no, a good choice or may ev„ b« 

impossible because of ...oration, Although i, lead, to th, smdled L x AD (adudly yu.f a b„l 

bi, smaller), if result, in mud. larger TO™, and 1 X TO™,. ■> W “ 

Ilff „ When p is large, /, = N/K, for BH/BH (= BH) is no. good ««,„ because of large 

L X AD, L x TD max , and L X W^x- 

g) When /, is large, CC/BH networks load to smaller TO™,, 1%™,, and I x TO™, (rf P '* 

dso tog.) bu. result in toge. I X !V„.„ CC/BH network, dso lead to a 
Bm »d AD (about 27% - 42% to Om and 25% - 40% to AD). However, the Lx AD of CC/ 
is relatively toge. Thus, there is a ttade-oH bet.emt performance and cost-eftotrven.ss m design 
Of netwotks. If pettomd.ee is considered as the main issue to a network, the degree of conncdron 
of clusters at level 1 may have to be higher than that at level 2. 
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Design of cost-effective hierarchical net works 


Mp — ■ «■ h “ *“ ‘ t " di "‘ b I ZZl Z I o levels „ .be 

- rr " 21ZL - - P~ — - 

based hierarchical networks, nsi g ^ ^ consld ering more general cases that are not 

the number of links as the ^ ^ average message delay are also used as 

restricted to some spe P ^ The cost . effe ctiveness measures to be 

performance measures so that \ and L X W maI . 

used in the following analysis are thus L X AD , L xT — 

. ■ it : s necessary for us to define the problem more precisely. Recall 

In order to do the analys.s, ^ ^ commaDlcatl on to reduce the number 

the motivation for hierarchical networ s. xp k {reference network), we can 

of links. Thus, if locality exists, for a given ^ L at the higher level is 

construct a corresponding hierarchical networ m w n ^ ^ COU nterpart. The 

reduced so that the hierarchical network co e mor of cowiectio n 

can 

« int : faCC t^lXee of — at the lower level wMle keeping the 

network using a topology with g method can lead 

- “r of — - - - - 

to a significant reduction of 1 U max , ^max, ^ $ize of dusters at the 

cost-effective th^ its be used’only for the relatively small 

, 

.. ra above we know that the basic design parameters for a hierarchical 
Based on the two me o ^ ^ of interface nodes in each cluster, and the size 

network are the topology ’ parameters at the same time, but it is 

of clusters. The ideal approach is to ^ ^ ^ ^ of Cerent 

ray (even ,e jns. — * « ^ ^ 

topologies in the in.. >«*» ^ „ the. the topologies a. both lev* «• 

of interface nodes and the size of clusters, 1 . 

, . , nT n and IVrnai of the given reference network 

xxr An T Dr> and Wrt be the L, AD, TDmaxi an ™ 

Let Lr., ADji , T r, , v . r ADu TDh and Wjj represent 
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Problem: 


Given N p X f* (2) , an d the t°P olo S ies of a two ' level hierarchical network and its reference 

network, find the size of clusters at level 1 (N/KJ and the number of mterface nodes (J a ), su 

that 

1) Lh x ADn , L H X TDu, and/or L H X are minimized. 

2) < 1 and/or - 1 - 1 ' ° 


Some explanations are needed for the above definition: 


i, The tot condition i, to ensnr. that the resulting values cl two •» «*“>■ ^ 

second condition is to ensure ,h.« the resulting hierarchical network is tot. cost-effective then. ,.s 

reference network. 

„ r ^ r NIK i e n = f(N!Ki). Thus, here “gi ven P” meaas ttat 

ii) p is usually a function of N/Ku ie *> V - A*"/ W 

f(N/Ki) is known. 

iii) The reference network has the same topology as that of level 1 dusters or 
dusters or both. For example, • BH cm he a reference network for a CC/BH or a BH/ . 

iv) In general, it is very difficult to minimize dl of I» x dD„, L„ x TD„, and L„ x W„, 
taause the, require diffident mine. off, and Jf/Jf,. Bawd on the requirement. of an apphea .on, 
!iri, or Jo Of them are dtosen as the tutor measure, or a trade-offimost he made. Snmtoly, 
,h, three inequalities are hard to he sadshed a. the - dm,. When only some of the m.gnaht.e. 
cm be satisfied, we should ensut, that the left side of the o.h.t in.qmdrt.es ,, less than equ 

, small constant, so that wo could still have a cost-effective network. 

Solving the problem is not straight! forward b.o.ns, of multiple variable, m to me,nah«i« ^ 
the depeudeue, between them. Also, p depends on 1*/Jf, »d the compn.at.o. ot NJK, needs p 

Another tln« E we,hnuld„e».io.i.tha.for.hiermchiod»e..„rk,T 0 = max 

requite, computation of the second inequality separately for TB„ = ml ». - TB 

w, have assume TB„ = TdSL or TD„ = ™2, each time), aud ead, teccul mg 

pair' of Iffi/iC, and /, must he consistent with the pteassigned TD„ (be., to 
TO - TDh in to network). The - thing » to computation of the thud megnah.y he- 
caJTof IF„„ = ma x(wffi„ wSU Fo, to to. inegn.h.y (AD), we need to onns.de, whether 
= N/Ki or not. In the following, we propose an algorithm for the problem. 


23 



Algorithm: 

Step 1: For each of preassigned. ADh, TDh* and/or Wh, solve 

isH*L < 1 ^d/o, < I 5 1 ' 

LrxADr- LrxTDr lrxwr 

respectively, to find N / K x in tenns of N , p, A, p (l \ , and I x . 

« B» u»g»mg «d. possible value to /, and then solving the inequality involving N/K, 
and P = P( „/*,), find out dl possible pate of Jf/Jf. and /, whieh are valid for the preass.gned 

ADh, TDh, and/or Wh- 

Step 5; Compute L* X ADh, Lh X TDh, and/or I* X ^H for each valid pair of N/K, and 
j 1# Find the pair which minimizes Lh X ADh , Lh X TDh and/or Lh X Wh - ° 

For tills algorithm, we should note that: 

i) It may not find the optimal solution (i.e., the optimal pair of N/Ki and /,) if for some 
reason (e.g., difficulty of solving an inequality) not all possible valid pairs can be found. However, 
the algorithm will give the best pair from all available pairs, and it guarantees that any solution 
(optimal or near optimal if it exists) will lead to a hierarchical network which .s more cost-effect. ve 

than or equivalent to its reference network. 

ii) The algorithm tries to fix variables one by one, i.e., it solves for a variable at a time. This is 
because we want to avoid dealing with multiple variables at a single step. It also tnes to fix N/K, 
first because the size of clusters at level 1 is usually small, so that values of h could be lrnnted m 

a small range (< N/K i). 

We now analyze the algorithm in detail: 


Step 1: Find NjK\i 

The difficulty of this step is that the inequalities may not be solved easily. However, by sim- 
plifying these inequalities, we may solve them directly or numerically. For example, let us consider 
the case that the hierarchical network is a BH/BH and its reference network is a BH. We choose 
L x Wh as the main cost-effectiveness measure and assume TDh = TD mal to compute TDh an 
Wh* From the last section, we know that 

1 -p N 1 

TDh ~ “X” X K\ ’ Wh ~^ 2) ~^ TDh 1 
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1 


TD r ~ 1 , Wr 


/t ( U-A-TD r 

and for BH/BH versus BH, we always have L H < Lr. So, 


L h x Wu c Wh 


i( 2 ) - A 


L R xW R -W R 


Since i, must be _ (or a Cable * V * * * 

ttat iSl L H xW H < ^ N_ < Jl_ 

Lfi X Wr - ifi 1 - p 

Then, we need to check whether W/Ar in this range is also good for /, W x T23 H and L H X ADr. 
If M- < we have 

_ Lh x TD H TDjt _ l^P x IL<i. 

L r x TD r ~ TDr h Ki 

F» I, x eU* «f BH/BH, .. .» fUd.ba. i. m-»~ - A 

a BH is a special case of BH/BH with the maximum value of h (x.e., LrX ad r - 

hold. Therefore, N/K, in this range is acceptable, but it still needs to be validated at the next 

step to consist with the preassigned TDr- 

Step 2: Determine valid pairs of NjK x and h: 

The number of choices of h is at most NjK x . Sometimes we want I, to be a power of 2, winch 
will produce at most log (N/ Kl ) + 1 Voices. After assigning a possible value to « ~ lve 

the inequality involving p and N/K x and obtain a pair of N/Ki and h- Smce p / r), we 

may need to solve the inequality implicitly or numerically. Then, we check whether the resultmg 
pairs are valid, i.e., whether the pairs are consistent with the preassigned ADr, TDr, or 
example, if we use TD { ± to compute N/Ki at Step 1, then the pairs obtamed at Step 2 mus ea 

to TDh = TDlnaX' 


Step 3: Find the optimal pair of N/K x and J a s 

This step is straightforward: Use aU vahd pairs to compute all possible L H x ADr, L H X TDh, 
and/or l H X W H and then choose the pair leading to the minimum. Besides computmg these 
values, we may see the trends of L H X AD H , L H X TDh, and Lr X W H by drrect analyzmg related 
formulas. Here we show how h affects L x AD and L x TD ma x- 

1) Prom Eq. (13), L = + hL?\ we can find that h' 2 >, which is a function of K u will be 

quite large because K, is usually large. Thus, L wiU increase rap.dly as J, increases. At the same 
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ti „,, however, AD (see Eq. (2)) is .educed only u lil.fe, he-, •« - 0 - -M®" 1 

AD. Therefore, if L x AD Is considered as a main cosEefTeeti—ess measnre, small /, (1 or ) 

preferred. 

2) Prom Eqs. (3), (4), (5), and (7), we can find that TD%, TD\% and TD { m l are almost 
inversely proportional to h, while TD^a is independent of/,. Two situat.ons wrU occur: 

i) When Jr is small or the topology of clusters at level 1 has high degree of connection, usually 
T n _ TD W At that time, increasing h will definitely reduce TD max and L X TD maI , 

, i i.- r jxr T v W This can be seen from r>qs. 

which may also lead to the reductxon of W mal and L X MW 

(7) and (13). For example, if Jr is doubled, a half of TDLl wiU be reduced, but L will not 
be doubled because the item K,L^ of L does not increase with /, . Thus, we can increase r 

in this situation. 

11) A, TDSL dece.s.s, TdUL m„ become TO™ ,< - ■>»“ TD 

changed. After .he., mere-hq, & - «■& ** 

decrease for a while but it may eventually increase, if p is large (i.e., TD local is g )• 

situation, we can choose the /, which yields a value close to the turning pent. 


6 Concluding remarks 


A class of general hierarchical interconnection networks for message-passing ardutectures has 
been presented. The proposed hierarchical networks may have any number of mterface nodes m 
each duster. It has been found that increasing the number of interface nodes m each duster rs 
better than repeating links, because the fonner can considerably reduce mtracluster traffic densxty 
as weE as interduster traffic density and still use the same number of links as the latter. In addxt.on, 
it enhances the fault tolerance capability of the networks. 

The proposed network, with two level, have been evaluated in term, of the performance mea- 
sures - diameter, average internode distance, traffic density over links, and queueing delay with 
contention. By examining several typical networks, we have shown that different structures co 
significantly affect their performance and cost-effectxveness. We have also shown how desxgn of a 
cost-effective network relies on choosing appropriate design parameters, such as the s.ze of clusters 
and the number of mterface nodes, and how different cost-effectiveness measures reqrnre different 
values of these parameters. An algorithm has been developed for choosing the optimal design 

parameters. 
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7 tw ° ris “ tb “ tor ‘ " ra ‘ 

„„ degrade performance a, ^ ^ ^ g^ding on value. of deign ,a,™«er,. 

1,«1 network, «*«*»« «“ ' " , ffie i,n, hierarchical network - ««*« the 

Therefore, a good approach to the desig 

balance of traffic over all levels of the network. 
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Fig. 1. 


(a) A two-level CC/BH network with N = 16, Ky 
two-level BH/BH network with N = 32, K i - 4, h 
level 2 are darkened.) 


= 4, Ii = 2, and K 2 = 1. (b) A 
= 2, and Ki = 1. (The links at 





cluster (BH) with. 16 nodes. 




AD 



Fig 4. Average intemode distance (AD ) versus 7 , under. p- 0.5 
and p - 0.8, respectively, with 77=1024 and N/K j 16. 




1 2 3 4 log / 1 

(a) p = 0.5 



0 1 2 3 4 log/i 

(&) p =0.8 

5. The highest traffic density (TD^) versus / 1 under (a) p = 0.5 
and (b) p = 0.8, respectively, with N=1024 and N/K i=16. 



Fig 6 The longest queueing delay (Ww) versus h for 
(a) p = 0.5 and (b) p = 0.8, respectively, with 
N = 1024, N/K\ = 16, 1, and p® = p® = 3. 



L*AD 



Fig. 7. L*AD versus I\ under p =0.5 and/? =0.8, 
respectively, with N=1024 and N/K\-\6. 


L*TD max 



Fig. 8. L+TD^ versus I\ under p =05 and p -0.8, 
respectively, with N =1024 and N /K i=16. 



L *W max 



Fig. 9. L*W t versus /[ under p - 0.5 and p =0.8, respectively, 
with =1024, N/Ki=I6, k=h and jiW = |iP> = 3. 




